Cassandra Architecture

In this tutorial, we are going to discuss about the Cassandra architecture. Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers without a single point of failure. Cassandra architecture ensures high availability, fault tolerance, and linear scalability, making it an excellent choice for systems requiring massive data storage and rapid query performance.

The Apache Cassandra architecture is designed to provide scalability, availability, and reliability to store massive amounts of data.

Cassandra common terms

Before digging deep into Cassandra architecture, let’s first go through some of its common terms:

Column: A column is a key-value pair and is the most basic unit of data structure.

Column key: Uniquely identifies a column in a row.
Column value: Stores one value or a collection of values.

Row: A row is a container for columns referenced by primary key. Cassandra does not store a column that has a null value; this saves a lot of space.

Table: A table is a container of rows. A table in Cassandra, similar to a table in a relational database, is a collection of rows, but it has a more flexible structure. While relational tables have a fixed schema, Cassandra tables can have rows with varying numbers of columns.

Keyspace: Keyspace is a container for tables that span over one or more Cassandra nodes.

Cluster: Container of Keyspaces is called a cluster.

Node: Node refers to a computer system running an instance of Cassandra. A node can be a physical host, a machine instance in the cloud, or even a Docker container.

NoSQL: Cassandra is a NoSQL database which means we cannot have joins between tables, there are no foreign keys, and while querying, we cannot add any column in the where clause other than the primary key. These constraints should be kept in mind before deciding to use Cassandra.

Here’s an in-depth look at Cassandra architecture:

1. Distributed and Decentralized Architecture (Peer-to-Peer Model)

Cassandra follows a peer-to-peer architecture where every node (server) in the cluster is equal, meaning there is no master node. This decentralization improves reliability and scalability by avoiding bottlenecks:

Nodes: Each node in Cassandra is identical, participating equally in data storage and query processing.
Ring Topology: The nodes are logically arranged in a ring, and data is distributed across all nodes in the cluster. There is no single point of control, and every node can serve read/write requests.
Gossip Protocol: Nodes communicate with each other using a gossip protocol, a peer-to-peer communication protocol that helps them discover the state of other nodes and keep updated about the cluster’s health.

2. Partitioning and Data Distribution

Cassandra uses a distributed hash table (DHT) to distribute data across nodes. It breaks up the data into partitions using a consistent hashing mechanism:

Partition Key: Cassandra uses a partition key to determine where data is stored. The key is hashed, and the resulting hash value determines which node stores the data.
Token Ranges: Each node is assigned a range of tokens (hash values), and nodes are responsible for all data whose partition key hashes fall within their assigned token range.
Automatic Data Distribution: As data is written to the cluster, it is automatically distributed across multiple nodes based on the partition key.

3. Replication

Cassandra provides fault tolerance through data replication across multiple nodes. Each piece of data can be replicated to multiple nodes in the cluster for redundancy:

Replication Factor (RF): This defines how many copies of the data are stored on different nodes. If the RF is set to 3, Cassandra will replicate each piece of data on three different nodes.
Replication Strategies:
- SimpleStrategy: Primarily for single data centers. Replicates data to adjacent nodes in the ring.
- NetworkTopologyStrategy: Designed for multi-data center deployments. It allows you to specify how many copies of the data should be replicated across different data centers for high availability and disaster recovery.

4. Consistency and Tunable Consistency

Cassandra Architecture is designed to be highly available and partition-tolerant (CAP theorem), but it provides tunable consistency to balance between consistency and availability:

Consistency Levels: Cassandra offers a wide range of consistency levels for read/write operations. Examples include:
- ONE: Data is read from or written to a single replica.
- QUORUM: Data is read from or written to the majority of replicas.
- ALL: Data is read from or written to all replicas.
Eventual Consistency: By default, Cassandra offers eventual consistency, meaning that updates will eventually propagate to all replicas, but not necessarily immediately.

5. Data Model

Cassandra’s data model is a wide-column store, meaning it organizes data into tables with flexible, schema-less columns:

Keyspace: A keyspace is the top-level data container, akin to a database in relational systems. It defines the replication strategy and number of replicas.
Tables: Inside keyspaces, data is organized into tables. Tables in Cassandra are similar to relational tables but are more flexible in terms of schema.
Columns: Each row in a Cassandra table can have a different number of columns. Columns are grouped into rows, and rows are identified by a primary key, which consists of a partition key (used for data distribution) and optional clustering columns (used for sorting data within partitions).

6. Writes in Cassandra

Cassandra is optimized for high-speed writes and uses a Log-Structured Merge-tree (LSM) storage model:

Memtable: Writes are first written to an in-memory data structure called a memtable.
Commit Log: Writes are also appended to a commit log for durability in case of node failure.
SSTables (Sorted String Tables): When the memtable is full, data is flushed to disk as immutable SSTables. Over time, SSTables are merged during compaction to reduce data fragmentation.

7. Reads in Cassandra

Cassandra reads data by checking multiple places:

Memtable: First, Cassandra checks the in-memory memtable for recent writes.
SSTables: If data is not found in the memtable, it checks on-disk SSTables, reading indexes and data files.
Bloom Filters: Cassandra uses bloom filters to optimize reads by quickly identifying whether a specific SSTable might contain the requested data.
Row Cache (optional): Frequently read data can be cached in memory to speed up reads.

8. Compaction

Cassandra uses compaction to merge SSTables, reducing fragmentation and reclaiming disk space:

Compaction Strategy: Cassandra offers various compaction strategies, such as Size-Tiered Compaction (merging similarly sized SSTables) and Leveled Compaction (designed to minimize read latency).
Garbage Collection: During compaction, obsolete or tombstoned data (marked for deletion) is purged from disk.

9. Fault Tolerance and Recovery

Data Replication: Replication across multiple nodes ensures high availability. If a node fails, other replicas can serve the data.
Hinted Handoff: If a node is temporarily down, Cassandra stores a hint on another node, which will attempt to send the data back to the downed node when it comes back online.
Read Repair: During read operations, if inconsistencies are detected between replicas, Cassandra will repair the out-of-date replicas.
Anti-Entropy Repair: Periodically, Cassandra runs an anti-entropy repair process to ensure that all replicas have the correct and up-to-date data.

10. Scaling

Cassandra architecture is linearly scalable, meaning you can add nodes to the cluster without affecting performance:

Horizontal Scaling: Instead of scaling up (adding more powerful hardware), Cassandra is designed for scaling out by adding more nodes.
No Downtime Scaling: Nodes can be added or removed without taking the system offline, and Cassandra automatically redistributes data across the cluster.
Token Assignment: New nodes are assigned token ranges, and data is rebalanced across the ring to accommodate the new node’s range.

11. Multi-Datacenter Support

Cassandra architecture natively supports deployments across multiple data centers. This ensures:

Geographic Fault Tolerance: Data is replicated across data centers, ensuring that the system remains operational even if an entire data center goes offline.
Low Latency: Reads and writes can be served from the nearest data center, reducing latency for geographically distributed applications.

Key Advantages of Cassandra Architecture

Here are the key advantages of cassandra architecture.

No Single Point of Failure: Its peer-to-peer architecture ensures that no single node controls the cluster, and replication guarantees high availability.
High Availability: Data replication across nodes and data centers ensures redundancy.
Scalability: It can easily scale horizontally by adding more nodes without disruption.
Fault Tolerance: It has robust mechanisms like hinted handoff, replication, and automatic failover to handle node failures gracefully.
Tunable Consistency: You can adjust the consistency level based on the application’s needs for data accuracy versus availability.

Conclusion

Cassandra architecture makes it ideal for large-scale, high-throughput applications that require fault tolerance, high availability, and horizontal scalability. Its decentralized, peer-to-peer design eliminates single points of failure, while its tunable consistency options allow users to balance between performance and data accuracy based on their specific use cases.

Cassandra architecture is a perfect fit for applications needing high availability, low-latency writes, and the ability to handle large datasets across geographically distributed locations.

That’s all about the Cassandra architecture in advanced system design concepts. If you have any queries or feedback, please write us email at contact@waytoeasylearn.com. Enjoy learning, Enjoy system design..!!

Cassandra Architecture