Cassandra Architecture Fundamentals

Pramod Shehan

13 min readNov 22, 2019

1) NoSQL

Do not follow the traditional relational database management systems principles.
ACID principles is reduced. (Atomicity, Consistency, Isolation and Durability).

Atomicity- a transaction must be treated as an atomic unit, that is, either all of its operations are executed or none.
Consistency- The database must remain in a consistent state after any transaction.
Isolation- In a database system where more than one transaction are being executed simultaneously and in parallel. No transaction will affect the existence of any other transaction.
Durability- The database should be durable enough to hold all its latest updates even if the system fails or restarts

Normalization is not mandatory. databases are de-nomarlized.

2) CAP Theorem

CAP theorem is Consistency, Availability and partition tolerance. This theorem was proposed by Eric Brewer in 1999. CAP is basis of distributed databases.

CAP theorem is a concept that distributed database system can have two of above three. Not possible to guarantee all three concepts simultaneously in the distributed databases.

Consistency- Same data must be available in all nodes in the cluster at the same time.
Availability- Every request must get response whether it is succeeded or failed.
Partition Tolerance- The system continues to run without losing messages. If part of the system is failed, it is not affected to the entire system.

All the distributed databases must be come up with partition tolerance aspect. It is mandatory. Without partition tolerance, entire system is down.

When we consider Cassandra it is mainly satisfied only Partition tolerance and Availability. But we can improve consistency also.

Consistency + Partition Tolerance- mongoDB, Reddis, Apache HBASE
Availability + Partition Tolerance- Cassandra

3) Cassandra Features

`Dynamo + Big Table = Cassandra`

Architecture requirements

No single point of failure.
Massive scalability .
Possibility to add nodes any time.
Highly distributed.
High performance.

Features of Cassandra Architecture

No masters and slaves (Peer to peer).
Ring type architecture
Automatic data distribution across all nodes
Replication of data across nodes
Data kept in memory and written to disk in a lazy fashion
Hash values of the keys are used to distribute data among nodes

Additional features

Supports multiple data centers
Data can be replicated across data centers

4) What is the coordinator?

When Cassandra client is hit a write and read request on the node in the cluster. That node is called as coordinator. According to the below picture, coordinator node may be changed every time.

Coordinator is selected by the cassandra driver based on the policy, you have set. Most common policies are DCAwareRoundRobinPolicy and TokenAwarePolicy.

5) Partitioning

Cassandra organizes data into partitions. This is a common concept of distributed data systems. All the data is distributed into chucks called partition. Partitioning is very important for performance and scalability.

In Cassandra, Partition is performed using hashing. In here Cassandra is using column called partitioning key.

6) Partition key

The partition key is responsible for distributing data among nodes. In Cassandra, we can only access data from the partitioning key.

In above table, Car name is a partitioning key. When data is coming to the cassandra, it gets token of that incoming data by providing partitioning key to the hash function.

BMW -> Hash Function -> 9

Toyata -> Hash Function -> 17

Audi -> Hash Function -> 31

Benz -> Hash Function -> 25

below diagram is displayed how the data is distributed among the cluster by using partition key.

7) Clustering key

Clustering columns order data within a partition. Each primary key column after the partition key is considered a clustering key.

The database uses the clustering information to identify where the data is within the partition.

8) Cassandra Write Operation

When Cassandra get a write request,

Write in a commit log
Write in a memtable
Send acknowledgement to the client.
After reaching configured limit, Flushing data from the memtable.
Storing data on disk in SSTable.
Data in the commitlog is purged after the flush.

Commit log -The commit log is a cash recovery mechanism. Commit log is a file on hard disk. there is no data structure. Commit log stores log record of every transaction happening in Cassandra. This ensures durability.

memtable- MemTable is data structure which defining in the memory.The memtable stores writes in sorted order. Simply Memtable is dedicated in memory cache created for each cassandra table.

Why Cassandra write operation is fast?

SSTable are immutable. When Cassandra writes into the SSTable, it couldn’t update them. If Cassandra needs to update column, it adds new record to the SSTable. So this is very slow to write to SSTable directly.

To get a better performance, Cassandra does not write directly on SSTable. It writes commit log and memtable and send an acknowledgement to the client. It keeps writes on memory after exceeding configurable limit flushes those changes to the SSTable using sequential I/O. Random I/0 is avoided. So there is a minimal disk I/O at the time. That is why cassandra write operation is very fast.

What happened Cassandra machine crash before flushing data to the SSTable?

All the data writes on memtable. We think machine is crashed. After that we fix the machine and start the machine again. All the memtable data are lost.

To avoid loosing data, Cassandra is using commit logs. When something changes on memTable, Cassandra writes the changes to commit log to keep track changes of memtable.

Why Cassandra writes data to commit log instead of SSTable?

The commit log stores changes in a single file. So disk does not do a huge I/O operation. Moreover disk does not need to do a bunch of seeks when it is receiving updates for multiple column families at the same time.

9) Cassandra Read Operation

When read operation comes to Cassandra, that operation hit on one node. that node is coordinator node.

The row key must be supplied for every read operation.

The row key is another name for the PRIMARY KEY.(partition key + clustering keys)

These are the steps when reading data from Cassandra,

Check the memtable
Check row cache, if enabled

Row cache pulls entire partitions into memory. if any part of partition has been modified , the entire cache for that row is invalidated. So invalidating big pieces of memory.

Checks Bloom filter
Checks partition key cache, if enabled

The key cache holds the location of keys in memory on a per-column family basis. Key cache helps where a particular partition begins in the SSTable.

Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not If the partition summary is checked, then the partition index is accessed

Partition summery is sampling of partition index to speedup the access to index on disk.Default sampling ratio is 128, meaning that for every 128 records for a index in index file, we have 1 records in partition summary. Each of these records of partition summary will hold key value and offset position in index.

Locates the data on disk using the compression offset map

Compression offset maps holds the offset information for compressed blocks.

Fetches the data from the SSTable on disk

When Cassandra gets read operation, First data gets from memtable. After that data gets from SSTables. If compaction is not run recently, so there may be several SSTables. Therefore data gets from every SSTables. There is timestamp column for every row. We get a latest record by that timestamp.

Below image is described how to get latest BMW record from Cassandra when BMW record are available in memtable and several SSTables.

10) Snitch

The job of a snitch is to determine which data centers and racks are to be written to and read from(relative host proximity/host is relatively nearer).

Snitch gathers information about topology(simple Strategy or Network Topology Strategy).

There are several types of snitches. Depending on Snitches, It can efficiently route the read and write requests among the cluster.

Simple Snitch

This is the default Snitch in Cassandra. Simple Snitch assumes all the nodes are in same data center and same rack(only for single data center).

2. Property File Snitch

This Snitch is using cassandra-topology.properties file. That snitch can be used to determine proximity by using data center and rack.

Every nodes in the cluster must be listed in the casssandra-topology.properties file.

This cassandra-topology.properties file must be same on every node in the cluster.

# datacenter One

175.56.12.105=DC1:RAC1
175.50.13.200=DC1:RAC1
175.54.35.197=DC1:RAC1

120.53.24.101=DC1:RAC2
120.55.16.200=DC1:RAC2
120.57.102.103=DC1:RAC2

# datacenter Two

110.56.12.120=DC2:RAC1
110.50.13.201=DC2:RAC1
110.54.35.184=DC2:RAC1

50.33.23.120=DC2:RAC2
50.45.14.220=DC2:RAC2
50.17.10.203=DC2:RAC2

If we add new node into the cluster we must add it as well into that cassandra-topology.properties file by manually.

3. Gossiping Property File Snitch

This snitch is using file called cassandra-rackdc.properties. Every node has this file and we define rack and data center information in that property file.

dc=DC1
rack=RAC1

Those mentioned information of nodes are distributed among other nodes via gossiping. datacenter and rack names are case-sensitive.

4. Dynamic Snitch

5. Rack Inferring Snitch

11) Gossip Protocol

gossip is communication protocol. It is exchanged all the information about themselves and other nodes in the cluster. This process runs in every second. So all the nodes quickly know about status of other nodes.

What kind of data is exchanged between nodes,

Heartbeat State-

When the node started.
When gossiping session sent.

Application State-

Node status (Normal/Leaving/Joining)
Data center
Rack
Schema
Load(Performance of the node)
Severity(I/O)

Load and Severity give us good indication of current performance of the node.

12) Replication strategy

Cassandra is support replication. It stores data on multiple nodes. So Cassandra is ensured fault tolerance and reliability(the quality of being trustworthy).

Replication strategy determines where the replicas are placed across the data centers. There are two replication strategies.

Simple Strategy

This strategy is for using single data center and one rack.

First select the appropriate node using partition key.

Example- Partition key is BMW. Hash value is 5 of BMW. node 01 is stored data , if hash value is between 0–10. So first copy is stored in node 01.

Other replicas are selected using clockwise in the ring.(step 02 & step 03) So other two replicas are stored in node 02 and node 03.
So replication factor is 3, Cassandra is stored data only in 3 nodes.
If replication factor is 2, it is stored data in 2 nodes.

2. Network Topology Strategy

When you are using multiple data centers , This is the replication strategy for replication. We can define how many replicas in each data centers.

NetworkTopology tries to place replicas on different racks. Rack can fail at the same time due to power, network and different issues. If all the nodes are same racks , when rack is failed, that data center’s replicas can not be used at that time.

REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 3};

DC1 called datacenter has 3 replicas and DC2 called datacenter has 2 replicas.

REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 3, 'DC2' : 2 };

we can restrict the replication of a keyspace to selected datacenters.

REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 0, 'DC2' : 3, 'DC3' : 0 };

13) Hinted Handoff

When a node becomes down or unresponsive, hinted handoff allows Cassandra to continue its write operation without any problems.

Now we can enabled and disabled hinted handoff by using cassandra.conf.

when node is down, all the writes data which belongs to that down node key range are stored for a period of time.

The hint consists as below,

target Id- downed node
hint Id- time uuid
message id- cassandra version
data is stored as a blob.

Hints are flushed to disk every 10 seconds.

When gossip knows that down node is up and running again, all the remaining hint are written to the new node. After that hint file is deleted.

There is an another configuration called “max_hint_window_in_ms” in cassandra.conf file.

If node is down for longer than max_hint_window_in_ms, stops writing new hints.

14) Tombstones (Soft delete)

Cassandra is not deleted data from the disk immediately. If Cassandra do it, it takes lots of time. That is why Cassandra comes up with Tombstones. Tombstones are a mechanism which allows Cassandra to write fast.

So Cassandra use marker as special value called tombstones to determine which data is deleted.

Tombstones prevent deleted data from being returned during reads.

Tombstone is generated by,

DELETE statement
TTL (time-to-live)
INSERT or UPDATE with null values
Update with collection column

tombstone_warn_threshold - Cassandra will display a warning, if the scanned tombstone count is exceeded tombstone_warn_threshold by a query.

tombstone_failure_threshold - Cassandra abort the query, if the scanned tombstone count is exceed tombstone_failure_threshold value by a query.

There is a setting called Garbage Collection Grace Seconds(gc_grace_seconds). This is the amount of time that the server will wait to garbage-collect a tombstone. default value is 10 days(864000 seconds). Tombstones will be dropped during compaction after gc_grace_second has passed.

Tombstones will not be removed until a compaction event even if gc_grace_seconds has elapsed.

Problems of Tombstones

This is how cassandra.yaml is described about a lots of tombstones.

Lot of tombstones are come up with performance problems and server heap.

Cassandra abort the query, if the scanned tombstone count is exceed tombstone_failure_threshold value by a query.
cluster is rapidly filling up.

15) Compaction

SSTables are immutable. Mutations , adding new data, updating data, deleting data are inserted into memtable. Always adding new record to the memtable when doing above mentioned operations. After that memtable is periodically flushed to the different SSTables.

When we consider update operation in Cassandra, there may be old value and new value in different SSTables or same SSTable. In that case Cassandra is using timestamps to figure out which is the most recent value. In here Cassandra is using lots of disk space. In this case, we are trying to read some data from the Cassandra, query might need to read from several SSTables to get a result. So read operation may be slow. That is why Cassandra needs a operation called compaction.

Compaction is doing read all the existing SSTables and merge all the rows with most recent information into the one SSTable. Basic idea of compaction is created new SSTable instead of existing SSTables.

As we discussed earlier, SSTables are immutable. If Cassandra need to update an existing row, it will add another new row in same SSTable or different SSTable.

When doing compaction, all the tombstones are permanently removed.

There are several compaction strategies.

Size Tiered Compaction Strategy- default one
Leveled Compaction Strategy
DateTiered Compaction Strategy
Time Window Compaction Strategy

16) Bloom Filter

Bloom filter is a data structure. It is used to test whether an element is a member of a set. False positive matches are possible and false negative are not. This is extremely very fast.

Cassandra is using bloom filter to check whether requested partition key is available in any of the SSTables without reading existing data of SSTables. By using blooom filter, we can avoid expensive I/O operations.

There are corresponding bloom filers in memory per each SSTables.

17) Cassandra Consistency

This is concept of CAP theorem. We can configure consistency for a session or per individual read and write operation.

So we can improve consistency level using below strategies.

Write consistency level

Any- a write must succeed on any available node.(Highest availability)
One- a write must succeed on any node responsible for that row.(either primary or replica)
Two- a write must succeed on two nodes.
Quorum- a write must succeed on a quorum of replica nodes.

quorum nodes = (replication factor/2) + 1

Local Quorum- a write must succeed on a quorum of replica nodes in the same data center as the coordinator nodes.
Each Quorum- a write must succeed on a quorum of replica nodes in all data centers.
All- a write must succeed on all replica nodes.(Lowest availability)

Read consistency level

One- reads from the closest node holding the data.
Quorum- return a result from quorum of servers with the most recent timestamp for the data.
Local Quorum- return a result from a quorum of servers with the most recent timestamp for the data in the same data center as the coordinator node.
Each Quorum- return a result from a quorum of servers with the most recent timestamp in all data centers.
All- return a result from all replica nodes.