The CAP Theorem

In this tutorial, we are going to explore about the CAP Theorem and its usage in distributed systems.

Background

In distributed systems, different types of failures can occur, e.g., servers can crash or fail permanently, disks can go bad resulting in data losses, or network connection can be lost, making a part of the system inaccessible. How can a distributed system model itself to get the maximum benefits out of different resources available?

Solution

The CAP theorem states that a distributed data store can only guarantee two out of the three following properties at any given time:

Consistency ( C )

All nodes see the same data at the same time. This means users can read or write from/to any node in the system and will receive the same data. It is equivalent to having a single up-to-date copy of the data.
Every read operation will return the most recent write or an error.
Example: After a write operation, all users see the updated data immediately.

Availability ( A )

Every request (read or write) will receive a response (either success or failure), even if some of the nodes are unavailable.
The system is highly responsive, but this does not guarantee that the most up-to-date data is returned.
In simple terms, availability refers to a system’s ability to remain accessible even if one or more nodes in the system go down.
Example: The system will continue to work even if part of it is down.

Partition tolerance ( P )

A partition is a communication break (or a network failure) between any two nodes in the system, i.e., both nodes are up but cannot communicate with each other.
The system continues to function despite network partitions (communication breakdowns between nodes).
Even if some nodes are not communicating with others, the system as a whole should still be available and provide data access.
A partition-tolerant system continues to operate even if there are partitions in the system. Such a system can sustain any network failure that does not result in the failure of the entire network.
Example: The system can handle latency or failure in network communication between distributed systems.

According to the CAP theorem, any distributed system needs to pick two out of the three properties. The three options are CA, CP, and AP. However, CA is not really a coherent option, as a system that is not partition-tolerant will be forced to give up either Consistency or Availability in the case of a network partition. Therefore, the theorem can really be stated as: In the presence of a network partition, a distributed system must choose either Consistency or Availability.

We cannot build a general data store that is continually available, sequentially consistent, and tolerant to any partition failures. We can only build a system that has any two of these three properties. Because, to be consistent, all nodes should see the same set of updates in the same order. But if the network loses a partition, updates in one partition might not make it to the other partitions before a client reads from the out-of-date partition after having read from the up-to-date one. The only thing that can be done to cope with this possibility is to stop serving requests from the out-of-date partition, but then the service is no longer 100% available.

Understanding the Trade-offs

According to the CAP Theorem, a distributed system can guarantee at most two of these properties at the same time:

CP (Consistency and Partition Tolerance):

Guarantees consistency and partition tolerance but sacrifices availability.
Example: HBase, Zookeeper. If a network partition occurs, some requests may not receive a response to ensure data consistency.

AP (Availability and Partition Tolerance)

Guarantees availability and partition tolerance but sacrifices consistency.Example: CouchDB, Cassandra.
These systems will respond to requests, but some may return stale data during a partition.

CA (Consistency and Availability)

Guarantees consistency and availability but sacrifices partition tolerance.
This is rarely seen in modern distributed systems because partitions are almost always possible in large-scale systems. If partitioning occurs, the system would not be able to tolerate it.

Example Systems and Their CAP Choices

Lets explore about the CAP theorem examples and their CAP choices

Cassandra: Prioritizes Availability and Partition Tolerance (AP). It may serve stale data but ensures that the system continues to respond.
MongoDB: Can be configured to prioritize Consistency and Partition Tolerance (CP) or Availability and Partition Tolerance (AP) depending on the configuration.
HBase: Prioritizes Consistency and Partition Tolerance (CP), sacrificing availability during network partitions.

The CAP Theorem highlights the inherent trade-offs in distributed systems, forcing developers to make critical choices depending on the needs of their application (e.g., whether data consistency or system availability is more important).

That’s all about the CAP Theorem. If you have any queries or feedback, please write us at contact@waytoeasylearn.com. Enjoy learning, Enjoy system design interview series..!!

The CAP Theorem