Cassandra Architecture Fundamentals

1) NoSQL

  • Do not follow the traditional relational database management systems principles.
  1. Atomicity- a transaction must be treated as an atomic unit, that is, either all of its operations are executed or none.
  • Normalization is not mandatory. databases are de-nomarlized.

2) CAP Theorem

CAP theorem is Consistency, Availability and partition tolerance. This theorem was proposed by Eric Brewer in 1999. CAP is basis of distributed databases.

CAP theorem is a concept that distributed database system can have two of above three. Not possible to guarantee all three concepts simultaneously in the distributed databases.

  • Consistency- Same data must be available in all nodes in the cluster at the same time.

All the distributed databases must be come up with partition tolerance aspect. It is mandatory. Without partition tolerance, entire system is down.

When we consider Cassandra it is mainly satisfied only Partition tolerance and Availability. But we can improve consistency also.

Consistency + Partition Tolerance- mongoDB, Reddis, Apache HBASE

Availability + Partition Tolerance- Cassandra

3) Cassandra Features

Dynamo + Big Table = Cassandra

Architecture requirements

  • No single point of failure.

Features of Cassandra Architecture

  • No masters and slaves (Peer to peer).

Additional features

  1. Supports multiple data centers

4) What is the coordinator?

When Cassandra client is hit a write and read request on the node in the cluster. That node is called as coordinator. According to the below picture, coordinator node may be changed every time.

Coordinator is selected by the cassandra driver based on the policy, you have set. Most common policies are DCAwareRoundRobinPolicy and TokenAwarePolicy.

5) Partitioning

Cassandra organizes data into partitions. This is a common concept of distributed data systems. All the data is distributed into chucks called partition. Partitioning is very important for performance and scalability.

In Cassandra, Partition is performed using hashing. In here Cassandra is using column called partitioning key.

6) Partition key

The partition key is responsible for distributing data among nodes. In Cassandra, we can only access data from the partitioning key.

In above table, Car name is a partitioning key. When data is coming to the cassandra, it gets token of that incoming data by providing partitioning key to the hash function.

BMW -> Hash Function -> 9

Toyata -> Hash Function -> 17

Audi -> Hash Function -> 31

Benz -> Hash Function -> 25

below diagram is displayed how the data is distributed among the cluster by using partition key.

7) Clustering key

Clustering columns order data within a partition. Each primary key column after the partition key is considered a clustering key.

The database uses the clustering information to identify where the data is within the partition.

8) Cassandra Write Operation

When Cassandra get a write request,

  1. Write in a commit log

Commit log -The commit log is a cash recovery mechanism. Commit log is a file on hard disk. there is no data structure. Commit log stores log record of every transaction happening in Cassandra. This ensures durability.

memtable- MemTable is data structure which defining in the memory.The memtable stores writes in sorted order. Simply Memtable is dedicated in memory cache created for each cassandra table.

Why Cassandra write operation is fast?

SSTable are immutable. When Cassandra writes into the SSTable, it couldn’t update them. If Cassandra needs to update column, it adds new record to the SSTable. So this is very slow to write to SSTable directly.

To get a better performance, Cassandra does not write directly on SSTable. It writes commit log and memtable and send an acknowledgement to the client. It keeps writes on memory after exceeding configurable limit flushes those changes to the SSTable using sequential I/O. Random I/0 is avoided. So there is a minimal disk I/O at the time. That is why cassandra write operation is very fast.

What happened Cassandra machine crash before flushing data to the SSTable?

All the data writes on memtable. We think machine is crashed. After that we fix the machine and start the machine again. All the memtable data are lost.

To avoid loosing data, Cassandra is using commit logs. When something changes on memTable, Cassandra writes the changes to commit log to keep track changes of memtable.

Why Cassandra writes data to commit log instead of SSTable?

The commit log stores changes in a single file. So disk does not do a huge I/O operation. Moreover disk does not need to do a bunch of seeks when it is receiving updates for multiple column families at the same time.

9) Cassandra Read Operation

When read operation comes to Cassandra, that operation hit on one node. that node is coordinator node.

The row key must be supplied for every read operation.

The row key is another name for the PRIMARY KEY.(partition key + clustering keys)

These are the steps when reading data from Cassandra,

  • Check the memtable

Row cache pulls entire partitions into memory. if any part of partition has been modified , the entire cache for that row is invalidated. So invalidating big pieces of memory.

  • Checks Bloom filter

The key cache holds the location of keys in memory on a per-column family basis. Key cache helps where a particular partition begins in the SSTable.

  • Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not If the partition summary is checked, then the partition index is accessed

Partition summery is sampling of partition index to speedup the access to index on disk.Default sampling ratio is 128, meaning that for every 128 records for a index in index file, we have 1 records in partition summary. Each of these records of partition summary will hold key value and offset position in index.

  • Locates the data on disk using the compression offset map

Compression offset maps holds the offset information for compressed blocks.

  • Fetches the data from the SSTable on disk

When Cassandra gets read operation, First data gets from memtable. After that data gets from SSTables. If compaction is not run recently, so there may be several SSTables. Therefore data gets from every SSTables. There is timestamp column for every row. We get a latest record by that timestamp.

Below image is described how to get latest BMW record from Cassandra when BMW record are available in memtable and several SSTables.

10) Snitch

The job of a snitch is to determine which data centers and racks are to be written to and read from(relative host proximity/host is relatively nearer).

Snitch gathers information about topology(simple Strategy or Network Topology Strategy).

There are several types of snitches. Depending on Snitches, It can efficiently route the read and write requests among the cluster.

  1. Simple Snitch

This is the default Snitch in Cassandra. Simple Snitch assumes all the nodes are in same data center and same rack(only for single data center).

2. Property File Snitch

This Snitch is using file. That snitch can be used to determine proximity by using data center and rack.

Every nodes in the cluster must be listed in the file.

This file must be same on every node in the cluster.

# datacenter One

# datacenter Two

If we add new node into the cluster we must add it as well into that file by manually.

3. Gossiping Property File Snitch

This snitch is using file called Every node has this file and we define rack and data center information in that property file.


Those mentioned information of nodes are distributed among other nodes via gossiping. datacenter and rack names are case-sensitive.

4. Dynamic Snitch

5. Rack Inferring Snitch

11) Gossip Protocol

gossip is communication protocol. It is exchanged all the information about themselves and other nodes in the cluster. This process runs in every second. So all the nodes quickly know about status of other nodes.

What kind of data is exchanged between nodes,

Heartbeat State-

  • When the node started.

Application State-

  • Node status (Normal/Leaving/Joining)

Load and Severity give us good indication of current performance of the node.

12) Replication strategy

Cassandra is support replication. It stores data on multiple nodes. So Cassandra is ensured fault tolerance and reliability(the quality of being trustworthy).

Replication strategy determines where the replicas are placed across the data centers. There are two replication strategies.

  1. Simple Strategy

This strategy is for using single data center and one rack.

  • First select the appropriate node using partition key.

Example- Partition key is BMW. Hash value is 5 of BMW. node 01 is stored data , if hash value is between 0–10. So first copy is stored in node 01.

  • Other replicas are selected using clockwise in the ring.(step 02 & step 03) So other two replicas are stored in node 02 and node 03.

2. Network Topology Strategy

When you are using multiple data centers , This is the replication strategy for replication. We can define how many replicas in each data centers.

NetworkTopology tries to place replicas on different racks. Rack can fail at the same time due to power, network and different issues. If all the nodes are same racks , when rack is failed, that data center’s replicas can not be used at that time.

REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 3};

DC1 called datacenter has 3 replicas and DC2 called datacenter has 2 replicas.

REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 3, 'DC2' : 2 };

we can restrict the replication of a keyspace to selected datacenters.

REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC1' : 0, 'DC2' : 3, 'DC3' : 0 };

13) Hinted Handoff

When a node becomes down or unresponsive, hinted handoff allows Cassandra to continue its write operation without any problems.

Now we can enabled and disabled hinted handoff by using cassandra.conf.

when node is down, all the writes data which belongs to that down node key range are stored for a period of time.

The hint consists as below,

  • target Id- downed node

Hints are flushed to disk every 10 seconds.

When gossip knows that down node is up and running again, all the remaining hint are written to the new node. After that hint file is deleted.

There is an another configuration called “max_hint_window_in_ms” in cassandra.conf file.

If node is down for longer than max_hint_window_in_ms, stops writing new hints.

14) Tombstones (Soft delete)

Cassandra is not deleted data from the disk immediately. If Cassandra do it, it takes lots of time. That is why Cassandra comes up with Tombstones. Tombstones are a mechanism which allows Cassandra to write fast.

So Cassandra use marker as special value called tombstones to determine which data is deleted.

Tombstones prevent deleted data from being returned during reads.

Tombstone is generated by,

  • DELETE statement

tombstone_warn_threshold - Cassandra will display a warning, if the scanned tombstone count is exceeded tombstone_warn_threshold by a query.

tombstone_failure_threshold - Cassandra abort the query, if the scanned tombstone count is exceed tombstone_failure_threshold value by a query.

There is a setting called Garbage Collection Grace Seconds(gc_grace_seconds). This is the amount of time that the server will wait to garbage-collect a tombstone. default value is 10 days(864000 seconds). Tombstones will be dropped during compaction after gc_grace_second has passed.

Tombstones will not be removed until a compaction event even if gc_grace_seconds has elapsed.

Problems of Tombstones

  • This is how cassandra.yaml is described about a lots of tombstones.

Lot of tombstones are come up with performance problems and server heap.

  • Cassandra abort the query, if the scanned tombstone count is exceed tombstone_failure_threshold value by a query.

15) Compaction

SSTables are immutable. Mutations , adding new data, updating data, deleting data are inserted into memtable. Always adding new record to the memtable when doing above mentioned operations. After that memtable is periodically flushed to the different SSTables.

When we consider update operation in Cassandra, there may be old value and new value in different SSTables or same SSTable. In that case Cassandra is using timestamps to figure out which is the most recent value. In here Cassandra is using lots of disk space. In this case, we are trying to read some data from the Cassandra, query might need to read from several SSTables to get a result. So read operation may be slow. That is why Cassandra needs a operation called compaction.

Compaction is doing read all the existing SSTables and merge all the rows with most recent information into the one SSTable. Basic idea of compaction is created new SSTable instead of existing SSTables.

As we discussed earlier, SSTables are immutable. If Cassandra need to update an existing row, it will add another new row in same SSTable or different SSTable.

When doing compaction, all the tombstones are permanently removed.

There are several compaction strategies.

  1. Size Tiered Compaction Strategy- default one

16) Bloom Filter

Bloom filter is a data structure. It is used to test whether an element is a member of a set. False positive matches are possible and false negative are not. This is extremely very fast.

Cassandra is using bloom filter to check whether requested partition key is available in any of the SSTables without reading existing data of SSTables. By using blooom filter, we can avoid expensive I/O operations.

There are corresponding bloom filers in memory per each SSTables.

17) Cassandra Consistency

This is concept of CAP theorem. We can configure consistency for a session or per individual read and write operation.

So we can improve consistency level using below strategies.

Write consistency level

  • Any- a write must succeed on any available node.(Highest availability)

quorum nodes = (replication factor/2) + 1

  • Local Quorum- a write must succeed on a quorum of replica nodes in the same data center as the coordinator nodes.

Read consistency level

  • One- reads from the closest node holding the data.