CAP Theorem and Clouds

A background on CAP Theorem:

CAP Theorem is firmly anchored in the SOA (Service Oriented Architecture) movement and is showing promise as a way of classifying different types of Cloud Solution Architectures. What follows is an explanation about CAP Theorem, how it works, and why it is so relevant to anyone looking at Clouds (Public, Private, Hybrid, or otherwise).

Distributed Systems Theory – The CAP Theorem:

CAP Theorem was first mentioned by Eric Brewer in 2000 (CTO of Inktomi at the time) and was proven 2 years later. CAP stands for Consistency, Availability, and Partitioning tolerance. CAP Theory states that you can only have TWO of the three capabilities in a system. So you can have Consistency and Availability, but then you don’t have Partitioning tolerance. You could have Availability and Partitioning tolerance without rigid Consistency. Finally you could have Consistency and Partitioning tolerance without Availability.

The KEY assumption is that the system needs to persist data and/or has state of some type, if you don’t need either Data persistence or State ANYWHERE, you can get very close to having Consistence, Availability, and Partitioning simultaneously.

Understanding Consistency, Availability, and Partitioning:

Consistency is a system’s ability to maintain ACID properties of transactions (a common characteristic of modern RDBMS). Another way to think about this is how strict or rigid the system is about maintaining the integrity of reads/writes and ensuring there are no conflicts. In an RDBMS this is done through some type of locking.

Availability is system’s ability to sucessfully respond to ALL requests made. Think of data or state information split between two machines, a request is made and machine 1 has some of the data and machine 2 has the rest of the data, if either machine goes down not ALL requests can be fulfilled, because not all of the data or state information is available entirely on either machine.

Partitioning is the ability of a system to gracefully handle Network Partitioning events. A Network Partitioning event occurs when a system is no longer accessible (Think of a network connection failing). A different way of considering Partitioning tolerance is to think of it as message passing. If an individual system can no longer send/receive messages to/from other systems, it has been effectively “partitioned” out of the network.

A great deal of discussion has occurred over Partitioning and some have argued that it should be instead referred to as Latency. The idea being that if Latency is high enough, then even if an individual system is able to respond, the individual system will be treated by other systems as if it has been partitioned.

In Practice:

CAp – Think of a traditional Relational Database (i.e. MS SQL, DB2, Oracle 11g, Postgres), if any of these systems lose their connection or experience high latency they can not service all requests and therfore are NOT Partitioning tolerant (There are ways to solve this problem, but none are perfect)

cAP – A NOSQL Store (i.e. Cassandra, MongoDB, Voldemort), these systems are highly resilient to Network Partitioning (assuming that you have several servers supporting any of these systems) and they offer Availbility. This is achieved by giving up a certain amount of Consistency, these solutions follow an Eventual Consistency model.

CaP – This isn’t an attractive option, as your system will not always be available and wouldn’t be incredibly useful in an Cloud environment at least. An example would be a system that if one of the nodes fails, other nodes can’t respond to requests. Think of a solution that has a head-end where if the head-end fails, it takes down all of the nodes with it.

A Balancing Act

When CAP Theorem is put into practice, it is more of a balancing act where you will not truly leave out C,A, or P. It is a matter of which two of the three the system is closest to (as seen below).

NOTE: Post to follow tying this more closely to the Cloud coming tomorrow.