How to Defy Data Gravity
Since I have changed companies, I have been incredibly busy as of late and my blog has had the appearance of neglect. At a minimum I was trying to do a post or two per week. The tempo will be changing soon to move closer to this….
As a first taste of what will be coming in a couple of weeks I thought I would talk a bit about something I have been thinking a great deal about.
Is it possible to defy Data Gravity?
First a quick review of Data Gravity:
Data Gravity is a theory around which data has mass. As data (mass) accumulates, it begins to have gravity. This Data Gravity pulls services and applications closer to the data. This attraction (gravitational force) is caused by the need for services and applications to have higher bandwidth and/or lower latency access to the data.
Defying Data Gravity, how?
After considering how this might be possible, I believe that the following strategies/approaches could make it feasible to come close to Defying Data Gravity.
All of the bullets below could be leveraged to assist in defying Data Gravity, however they all have both pros and cons. The strengths of some of the patterns and technologies can be weaknesses of others, which is why they are often combined in highly available and scalable solutions.
All of the patterns below provide an abstraction or transformation of some type to either the data or the network:
- Load Balancing : Abstracts Clients from Services, Systems, and Networks from each other
- CDNs : Abstract Data from it’s root source to Network Edges
- Queueing (Messaging or otherwise) : Abstracts System and Network Latency
- Proxying : Abstracts Systems from Services (and vice versa)
- Caching : Abstracts Data Latency
- Replication : Abstracts Single Source of Data (Multiplies the Data i.e. Geo-Rep or Clustering)
- Statelessness : Abstracts Logic from Data Dependencies
- Sessionless : Abstracts the Client
- Compression (Data/Indexing/MapReduce) : Abstracts (Reduces) the Data Size
- Eventual Consistency : Abstracts Transactional Consistency (Reduces chances of running into Speed of Light problems i.e. Locking)
So to make this work, we have to fake the location and presence of the data to make our services and applications appear to have all of the data beneath them locally. While this isn’t a perfect answer, it does give the ability to move less of the data around and still give reasonable performance. Using the above patterns allows for the movement of an Application and potentially the services and data it relies on from one place to another – potentially having the effect of Defying Data Gravity. It is important to realize that the stronger the gravitational pull and the Service Energy around the data, the less effective any of these methods will be.
Why is Defying Data Gravity so hard?
The speed of light is the answer. You can only shuffle data around so quickly, even using the fastest networks, you are still bound by distance, bandwidth, and latency. All of these are bound by time, which brings us back to the speed of light. You can only transfer so much data across the distance of your network, so quickly (in a perfect world, the speed of light becomes the limitation).
The many methods explained here are simply a pathway to portability, but without standard services, platforms, and the like even with the patterns etc. it becomes impossible to move an Application, Service, or Workload outside of the boundaries of its present location.
A Final Note…
There are two ways to truly Defy Data Gravity (neither of which is very practical):
Store all of your Data Locally with each user and make them responsible for their Data
If you want to move, be willing to accept downtime (this could be minutes to months) and simply store off all of your data and ship it somewhere else. This method would work now matter how large the data set as long as you don’t care about being down.
This is a nice summary but misses out one very important aspect of data, which is its rate of change. If data isn’t changing its easy to make copies (e.g. CDN) that flow out one way from a central point. If data is changing slowly then eventual consistency may be able to keep it all in sync, but if data is being updated continuously then it effectively has higher gravity.
You could add this to the analogy by talking about data velocity. Since e=mc^2, as velocity approaches the speed of light the data becomes infinitely massive and has higher gravity. i.e. if data is too big and is changing too fast to replicate you have to move compute to be near it.
Another technique to consider is white-listing, where the only data being replicated is data that is known to be needed remotely. This can reduce traffic for cases where all of the data is accessed locally but a white-listed subset can be accessed remotely as well.
Maybe you should leave (data) gravity alone and focus more on mobility of execution which allows many new optimizations. Granted it does require the (routing) infrastructure to become aware of the service(s)/interaction/data context and able to cost, provision and predict fine grain workloads.
Microsoft has done a significant of research in defining cost (optimized) models and (job) scheduling policies suitable for the cloud.
Forgot to mention that you need to factor in the ratio of data input to data output following processing as transient data does have a cost – or in your analogy pull (friction).
I like how you’ve been extending data gravity. It seems like there are architectural approaches that “defy” or weaken data gravity. Some of the items you list that enable portability.
Another thing to consider, ‘twould seem, is the amount of data used in any particular service instance/request. That is, if the overall amount of data is quite large (high gravity) but the amount used for some particular service is small, then querying from afar and providing a service is quite feasible. This may be a likely case for analytics where multiple data sets are being combined in some way. This could be considered a form of compression (run a query to get a much smaller subset…).
Various optimizations of this may be what Mr. Louth is referring to. Alternatively, this is layering of services where the ‘initial service’ (the query service) must be close to the data well.
The best way to solve the data gravity problem is for all cloud services to allow any interaction to specify both the data storage and data processing locations (services). Any content no pertinent to the specifics of the service provider/vendor should be stored externally – as specified by customers during account creation or overridden on a per interaction basis. Customers should own the data created directly via their interactions. In fact it should be a mandatory requirement for company compliance that all data (not entirely specific to a vendors operations) be stored elsewhere and in an open format for backup/import/…. purposes.
I have previously outlined such an scenario in a blog entry titled “Metering in the Cloud: Visualizations Part 1” in which a customer instructs Salesforce to direct its storage/querying requirements to an external provider in this case AWS S3.