2016-12-20



The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

What is Cassandra?

Cassandra is a NoSQL database technology. NoSQL is referred as not only SQL and it means an alternative to traditional relational database technologies like MySQL, Oracle, MSSQL. Apache cassandra is a distributed database. With cassandra database do not lives only on one server but is spread across multiple servers. This allows database to grow almost infinitely as it is no longer dependent on specifications of one server.

It is a big data technology which provides massive scalability.

Why Use Cassandra?

The top features that makes Cassandra pretty comprehensive and more powerful can be summarized as below:

PROVEN - Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets.

FAULT TOLERANT - Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.

PERFORMANT - Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices.

DECENTRALIZED - There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.

SCALABLE - Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million requests per day), and eBay (over 100 nodes, 250 TB).

DURABLE - Cassandra is suitable for applications that can't afford to lose data, even when an entire data center goes down.

YOU'RE IN CONTROL - Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Handoff and Read Repair.

ELASTIC - Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.

Cassandra Read/Write Mechanism

Now we will discuss how Cassandra read/write mechanism works. First let's have a look on important components.

Following are the key components of Cassandra.

Node – This is the most basic component of cassandra and it is the place where data is stored.

Data Center – In simplest term a datacenter is nothing but a collection of nodes. A datacenter can be a physical datacenter or virtual datacenter.

Cluster – Collection of many data centers is termed as cluster.

Commit Log – Every write operation is written to Commit Log. Commit log is used for crash recovery. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.

Mem-table – A mem-table is a memory-resident data structure. Data is written in  commit log and mem-table simultaneously. Data stored in mem-tables are temporary and it is flushed to SSTables when mem-tables reaches configured threshold.

SSTable – This is the disk file where data is flushed when Mem-table reaches a certain threshold.

Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.

Cassandra Keyspace – Keyspace is similar to a schema in the RDBMS world. A keyspace is a container for all your application data.

CQL Table – A collection of ordered columns fetched by table row. A table consists of columns and has a primary key.

Cassandra Write Operation

Since cassandra is a distributed database technology where data are spread across nodes of the cluster and there is no master-slave relationship between the nodes, it is important to understand how data is stored (written) within the database.

Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in with a write of data to disk:

Logging data in the commit log

Writing data to the memtable

Flushing data from the memtable

Storing data on disk in SSTables

When a new piece of data is written, it is written at 2 places i.e Mem-tables and in the commit.log on disk ( for data durability). The commit log receives every write made to a Cassandra node, and these durable writes survive permanently even if power fails on a node.

Mem-tables are nothing but a write-back cache of data partition. Writes in Mem-tables are stored in a sorted manner and when Mem-table reaches the threshold, data is flushed to SSTables.

Flushing data from the Mem-Table

Data from Mem-table is flushed to SSTables in the same order as they were stored in Mem-Table. Data is flushed in following 2 conditions:

When the memtable content exceeds the configurable threshold

The commit.log space exceeds the commitlog_total_space_in_mb

If any of the condition reaches, cassandra places the memtables in a queue that is flushed to disk. The size of queue can be configured by using options memtable_heap_space_in_mb or memtable_offheap_space_in_mb in cassandra.yaml configuration file. Mem-Table can be manually flushed using command nodetool flush

Data in the commit log is purged after its corresponding data in the memtable is flushed to an SSTable on disk.

Storing data on disk in SSTables

Memtables and SSTables are maintained per table. The commit log is shared among tables. Memtable is flushed to an immutable structure called and SSTable (Sorted String Table).

Every SSTable creates three files on disk

Data (Data.db) – The SSTable data

Primary Index (Index.db) – Index of the row keys with pointers to their positions in the data file

Bloom filter (Filter.db) – A structure stored in memory that checks if row data exists in the memtable before accessing SSTables on disk

Over a period of time a number of SSTables are created. This results in the need to read multiple SSTables to satisfy a read request.  Compaction is the process of combining SSTables so that related data can be found in a single SSTable. This helps with making reads much faster.

The commit log is used for playback purposes in case data from the memtable is lost due to node failure. For example the node has a power outage or someone accidently shut it down before the memtable could get flushed.



Read Operation

Cassandra processes data at several stages on the read path to discover where the data is stored, starting with the data in the memtable and finishing with SSTables:

Check the memtable

Check row cache, if enabled

Checks Bloom filter

Checks partition key cache, if enabled

Goes directly to the compression offset map if a partition key is found in the partition key cache, or checks the partition summary if not.

If the partition summary is checked, then the partition index is accessed

Locates the data on disk using the compression offset map

Fetches the data from the SSTable on disk



There are three types of read requests that a coordinator sends to replicas.

Direct request

Digest request

Read repair request

The coordinator sends direct request to one of the replicas. After that, the coordinator sends the digest request to the number of replicas specified by the consistency level and checks whether the returned data is an updated data.

After that, the coordinator sends digest request to all the remaining replicas. If any node gives out of date value, a background read repair request will update that data. This process is called read repair mechanism.

Installing Cassandra on Red Hat Enterprise Linux 6

Before we start with installation, let's discuss few concepts first which will help you to understand installation process.

Bootstrapping - Bootstrapping is the process in which a newly-joining node gets the required data from the neighbors in the ring, so it can join the ring with the required data. Typically, a bootstrapping node joins the ring without any state or token and understands the ring structure after starting the gossip with the seed nodes; the second step is to choose a token to bootstrap.

During the bootstrap, the bootstrapping node will receive writes for the range that it will be responsible for after the bootstrap is completed. This additional write is done to ensure that the node doesn’t miss any new data during the bootstrap from the point when we requested the streaming to the point at which the node comes online.

Seed Nodes - The seed node designation has no purpose other than bootstrapping the gossip process for new nodes joining the cluster. Seed nodes are not a single point of failure, nor do they have any other special purpose in cluster operations beyond the bootstrapping of nodes.

In cluster formation, nodes see each other and “join”. They do not join just any node which respects the protocol, however. This would be risky: old partitioned replicas, different clusters, even malicious nodes, so on. So a cluster is defined by some initial nodes which are available at clear addresses and they become a reference for that cluster for any new nodes to join in trustable way. The seed nodes can go away after some time, the cluster will keep on.

Prerequisites

Static IP Address on all 4 nodes and make sure all 4 nodes are reachable to each other via hostname/IP.

hostname should have been defined in /etc/sysconfig/network file.

Update /etc/hosts file accordingly

Since we are working in lab environment, following credentials will be used through out this guide and we will set up Cassandra nodes in /etc/hosts file as below. You can update /etc/hosts file by executing the following command on each node.

vi /etc/hosts

192.168.10.70 cassdb01 #SEED-node
192.168.10.71 cassdb02 #Worker-node1
192.168.10.72 cassdb03 #Worker-node2
192.168.10.73 cassdb04 #Worker-node3

Save and exit

Open Firewall Ports

You need to allow access on 7000 and 9160 ports of Cassandra if you are using iptables with the following command on each node.

iptables -A INPUT -p tcp –dport 7000 -j ACCEPT
iptables -A INPUT -p tcp –dport 9160 -j ACCEPT

Create Cassandra user with sudo permissions.

You can download and use following script to create a user on server with sudo permissions.

The following steps will be performed on cassdb01 node.

wget https://raw.githubusercontent.com/zubayr/create_user_script/master/create_user_script.sh
chmod 777 create_user_script.sh
sh create_user_script.sh -s cassandra

When you are done with above commands, make sure cassandra user/group is created on server with the following command.

cat /etc/passwd | grep cassandra

Output
cassandra:x:501:501::/home/cassandra:/bin/bash

cat /etc/group | grep cassandra

Output
cassandra:x:501:

Installing Java

We need to install oracle java (jdk or jre) version 7 or greater and defined JAVA_HOME accordingly. You can install java with rpm based installer or using tar file.

Cassandra 3.0 and later require Java 8u40 or later. I'll be installing rpm package.

rpm -Uvh jdk-8u111-linux-x64.rpm

Note: If you have openjdk installed on your system then please remove it before installing oracle java.

Verify that JAVA_HOME is set correctly and you are getting an output similar to below and an output for java -version command

cat .bash_profile | grep JAVA_HOME

Output
JAVA_HOME=/usr/java/jdk1.8.0_111
PATH=$PATH:$HOME/bin/:$JAVA_HOME/bin:$CASSANDRA_HOME/bin
export PATH JAVA_HOME CASSANDRA_HOME

java -version

Output
java version “1.8.0_111”
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Install and Configure Cassandra

Latest Cassandra version can be found from cassandra Home Page. Download and extract apache-cassandra tar.gz file in a directory of your choice. I used /opt as destination directory.

tar -zxvf apache-cassandra-3.9-bin.tar.gz
ln -s /opt/apache-cassandra-3.9 /opt/apache-cassandra
chown cassandra:cassandra -R /opt/apache-cassandra
chown cassandra:cassandra -R /opt/apache-cassandra-3.9

Now, create necessary directories (for cassandra to store data)  and assign permissions on those directories.

mkdir /var/lib/cassandra/data
mkdir /var/log/cassandra
mkdir /var/lib/cassandra/commitlog
chown -R cassandra:cassandra /var/lib/cassandra/data
chown -R cassandra:cassandra /var/log/cassandra/
chown -R cassandra:cassandra /var/lib/cassandra/commitlog

Start the Cassandra service by executing the following command

$CASSANDRA_HOME/bin/cassandra -f -R

You will see below messages on command prompt which shows that cassandra have been started without any issues.

Output
INFO 11:31:15 Starting listening for CQL clients on localhost/127.0.0.1:9042 (unencrypted)…
INFO 11:31:15 Not starting RPC server as requested. Use JMX (StorageService->startRPCServer()) or nodetool (enablethrift) to start it
INFO 11:31:24 Scheduling approximate time-check task with a precision of 10 milliseconds
INFO 11:31:25 Created default superuser role ‘cassandra’

If you want to start cassandra as a service, you can use this script from github. Change value of following variable as per your environment.

CASS_HOME=/opt/apache-cassandra
CASS_BIN=$CASS_HOME/bin/cassandra
CASS_LOG=/var/log/cassandra/system.log
CASS_USER="root"
CASS_PID=/var/run/cassandra.pid

Save the file in /etc/init.d directory.

Now execute the following commands to add cassandra as a service

chmod +x /etc/init.d/cassandra
chkconfig –add cassandra
chkconfig cassandra on

Start cassandra service and verify its status by checking the system.log file

service cassandra status

Output
Cassandra is running.

System.log file contains the following info on my system and it means all well.

INFO  12:45:50 Node localhost/127.0.0.1 state jump to NORMAL

You can also verify using Nodetool and its output says node as UP and Normal
$CASSANDRA_HOME/bin/nodetool status

Output
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
— Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 168.62 KiB 256 100.0% 14ba62c6-59e4-404b-a6a6-30c9503ef3a4 rack1

Adding Node To Cassandra Cluster

Before we start installing apache cassandra on remaining nodes, We need to perform following configuration changes in cassandra.yaml file.

Navigate to cassandra.yaml file which is located under cassandra_install_dir/conf folder. Open the file in editor of your choice and look for following options:

Listen_address: Address where gossip will be listening to. This address can’t be localhost or 0.0.0.0, because the rest of nodes will try to connect to this address.

RPC_address: This is the address where thrift will be listening. We must put a existing IP address (it may be localhost, if we want to), or 0.0.0.0 if we want to listen through all of them. This is the address to which client applications interact with cassandra DB.

Seeds: Seed nodes are the nodes which will provide cluster info to the new nodes which are bootstrapped and are ready to join the cluster. Seed nodes become a reference for any new nodes to join cluster in trustable way.

The above settings need to be configured in cassandra.yaml file on each node which we want to put into the cluster.

Note: You must install the same version of Cassandra on the remaining nodes in the cluster.

Adding new nodes in Cassandra cluster

Install Cassandra on the new nodes, but do not start Cassandra service.

Set the following properties in the cassandra.yaml file and, depending on the snitch, the cassandra-topology.properties or cassandra-rackdc.properties configuration files:

auto_bootstrap – This property is not listed in the default cassandra.yaml configuration file, but it might have been added and set to false by other operations. If it is not defined in cassandra.yaml, Cassandra uses true as a default value. For this operation, search for this property in the cassandra.yaml file. If it is present, set it to true or delete it..

cluster_name – The name of the cluster the new node is joining. Ensure that cluster name is same for all nodes which will be part of cluster.

listen_address – Can usually be left blank. Otherwise, use IP address or hostname that other Cassandra nodes use to connect to the new node.

endpoint_snitch – The snitch Cassandra uses for locating nodes and routing requests. In my lab I am using simple snitch which is present as default in cassandra.yaml file and so I did not change or edit this.

num_tokens – The number of vnodes to assign to the node. If the hardware capabilities vary among the nodes in your cluster, you can assign a proportional number of vnodes to the larger machines.

seeds – Determines which nodes the new node contacts to learn about the cluster and establish the gossip process. Make sure that the -seeds list includes the address of at least one node in the existing cluster.

Installation and configuration changes steps will be remain same as we have already performed in above on cassdb01. Once you are done installing cassandra and making the configuration changes as mentioned above on your second node, you can start cassandra service on second node and look into nodetool status on cassandra node 1 and you will observe the new node joining the cluster.

Nodetool Status Output

You can see in below output that our second node has been added to cluster and its up and running.

Every 2.0s: /opt/apache-cassandra/bin/nodetool status Thru 20 23:31:44 2016
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
— Address               Load           Tokens Owns (effective) Host ID Rack
UN 192.168.10.70 214.99 KiB  256 100.0%    14ba62c6-59e4-404b-a6a6-30c9503ef3a4   rack1
UN 192.168.10.71 103.47 KiB  256 100.0%    3b19bc83-f483-4a60-82e4-109c90c49a14   rack1

You need to repeat the same steps for each node which you want to add in Cassandra cluster.

How Nodetool Works?

We will explain nodetool and see how we can manage a cassandra cluster using nodetool utility.

The nodetool utility is a command line interface for managing a cluster. It provides a simple command line interface to expose operations and attributes available with cassandra. There are hundreds of options available with nodetool utility but we will cover only those which are being used more often.

Nodetool version: This provides the version of Cassandra running on the specified node.

[root@cassdb01 ~]# nodetool version
ReleaseVersion: 3.9

Nodetool status: This is one of the most common command which you will be using in a cassandra cluster. It provide information about the cluster, such as the state, load, and IDs. It will aslo tell you the name of datacenter where your nodes are lying and what is their state.

State ‘UN’ referes to up and normal. When a new node is added to cluster, you might see the state of node as ‘UJ’ which means node is up and now in process of joining the cluster.

Nodetool status will give you IP address of all of your nodes and also how much percentage load each node is owning. It is not neccsary that each node will own exactly the same percentage of load. For e.g in a 4 node cluster, it is not neccessary that each node owns exactly 25% of total load on cluster. One node might be owning 30% and other may be at 22% or so. But there should not be much difference in % of load being owned by each node.

[root@cassdb04 ~]# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
— Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.10.72 175.92 KiB 256 48.8% 32da4275-0b20-4805-ab3e-2067f3b2b32b rack1
UN 192.168.10.73 124.63 KiB 256 50.0% c04ac5dd-02db-420c-9933-181b99848c4f rack1
UN 192.168.10.70 298.79 KiB 256 50.8% 14ba62c6-59e4-404b-a6a6-30c9503ef3a4 rack1
UN 192.168.10.71 240.57 KiB 256 50.4% 3b19bc83-f483-4a60-82e4-109c90c49a14 rack1

Nodetool info: This returns information about specific node. In output of above command you can see gossip state (whether its active or not), load on node,  rack and datacenter where node is placed.

[root@cassdb04 ~]# nodetool info
ID : c04ac5dd-02db-420c-9933-181b99848c4f
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 124.63 KiB
Generation No : 1484323575
Uptime (seconds) : 1285
Heap Memory (MB) : 65.46 / 1976.00
Off Heap Memory (MB) : 0.00
Data Center : datacenter1
Rack : rack1
Exceptions : 0
Key Cache : entries 11, size 888 bytes, capacity 98 MiB, 323 hits, 370 requests, 0.873 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 49 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 16, size 1 MiB, capacity 462 MiB, 60 misses, 533 requests, 0.887 recent hit rate, 70.149 microseconds miss latency
Percent Repaired : 100.0%
Token : (invoke with -T/–tokens to see all 256 tokens)

Note: To query about information of remote node, you can use -h and -p switch with nodetool info coomand. -h needs ip/fqdn of remote node and -p is the jmx port.

[root@cassdb04 ~]# nodetool -h 192.168.10.70 -p 7199 info
ID : 14ba62c6-59e4-404b-a6a6-30c9503ef3a4
Gossip active : true
Thrift active : false
Native Transport active: true
Load : 198.57 KiB
Generation No : 1484589468
Uptime (seconds) : 165
Heap Memory (MB) : 91.97 / 1986.00
Off Heap Memory (MB) : 0.00
Data Center : datacenter1
Rack : rack1
Exceptions : 0
Key Cache : entries 17, size 1.37 KiB, capacity 99 MiB, 71 hits, 102 requests, 0.696 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 49 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache : entries 12, size 768 KiB, capacity 464 MiB, 78 misses, 230 requests, 0.661 recent hit rate, 412.649 microseconds miss latency
Percent Repaired : 100.0%
Token : (invoke with -T/–tokens to see all 256 tokens)

Nodetool describecluster: This command will give you name of the cassandra cluster, default partitioner which is used in cluster, type of snitch being used etc.

[root@cassdb01 ~]# nodetool describecluster
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
86afa796-d883-3932-aa73-6b017cef0d19: [192.168.10.72, 192.168.10.73, 192.168.10.70, 192.168.10.71]

Nodetool ring: This command will tell you which node is responsible for handling which range of tokens. If you are using virtual node concept, each node will be responsible for 256 token ranges. This command will give you a very lengthy output as it will display each and every token associated with each node.

[root@cassdb04 ~]# nodetool ring

Datacenter: datacenter1
==========
Address Rack Status State Load Owns Token
9209474870556602003
192.168.10.70 rack1 Up Normal 240.98 KiB 50.81% -9209386221367757374
192.168.10.73 rack1 Up Normal 124.63 KiB 49.99% -9194836959115518616
192.168.10.73 rack1 Up Normal 124.63 KiB 49.99% -9189566362031437022
192.168.10.71 rack1 Up Normal 240.57 KiB 50.40% -9173836129733051192
192.168.10.71 rack1 Up Normal 240.57 KiB 50.40% -9164925147537642235
192.168.10.71 rack1 Up Normal 240.57 KiB 50.40% -9140745004897827128
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -9139635271358393037
192.168.10.73 rack1 Up Normal 124.63 KiB 49.99% -9119385776093381962
192.168.10.73 rack1 Up Normal 124.63 KiB 49.99% -9109674978522278948
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -9091325795617772970
192.168.10.71 rack1 Up Normal 240.57 KiB 50.40% -9063930024148859956
192.168.10.71 rack1 Up Normal 240.57 KiB 50.40% -9038394199082806631
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -9023437686068220058
192.168.10.73 rack1 Up Normal 124.63 KiB 49.99% -9021385173053652727
192.168.10.71 rack1 Up Normal 240.57 KiB 50.40% -9008429834541495946
192.168.10.70 rack1 Up Normal 240.98 KiB 50.81% -9003901886367509605
192.168.10.73 rack1 Up Normal 124.63 KiB 49.99% -8981251185746444704
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -8976243976974462778
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -8914749982949440380
192.168.10.71 rack1 Up Normal 240.57 KiB 50.40% -8896728810258422258
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -8889132896797497885
192.168.10.73 rack1 Up Normal 124.63 KiB 49.99% -8883470066211443416
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -8872886845775707512
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -8872853960586482247
192.168.10.72 rack1 Up Normal 175.92 KiB 48.80% -8842804282688091715
192.168.10.71 rack1 Up Normal 240.57 KiB 50.40% -8836328750414937464
192.168.10.70 rack1 Up Normal 240.98 KiB 50.81% -8818194298147545683

Nodetool cleanup: Nodetool cleanup is used to remove that data from a node for which it is not responsible for.

when a node auto bootstraps, it does not remove the data from the node that had previously been responsible for the data. This is so that,if  the new node were to go down shortly after coming online, the data would still exist.

The command to do data cleanup is as below.

[root@cassdb01 ~]# nodetool cleanup
WARN 16:54:29 Small cdc volume detected at /opt/apache-cassandra/data/cdc_raw; setting cdc_total_space_in_mb to 613. You can override this in cassandra.yaml

WARN 16:54:29 Only 52.796GiB free across all data volumes. Consider adding more capacity to your cluster or removing obsolete snapshots

Note: To remove data from a remote node, modify cleanup command as shown below

[root@cassdb01 ~]# nodetool -h 192.168.10.72 cleanup

To see what this command do, you can monitor on nodetool status and you will see load decreasing from that node where cleanup is ran.

Conclusion

We have successfully completed apache cassandra installation on all 4 nodes and added them in cassandra cluster.  I hope this guide will be useful to install and set up your apache cassandra cluster within your production environment.

Show more