Apache Cassandra, an open-source non-relational/NOSQL data base, is available as Apache Cassandra. It is highly scalable and can handle large amounts of data across multiple servers (Here we will use Amazon EC2 instances). This ensures high availability. To ensure reliability and fault tolerance, we will replicate data across multiple Availability Zones (AZs). We will also learn how data can be preserved even if an entire AZ is down.
The initial setup consists in a Cassandra cluster of 6 nodes (EC2s), spread across AZ-1a, AZ-1b, and AZ-1c.
Initial Setup
Cassandra Cluster has six nodes
AZ-1a – us-east-1a : Node 1, Node 2,
AZ-1b – us-east-1b Node 3, Node 4,
AZ-1c – us-east-1c Node 5, Node 6,
Next, we need to make changes to the Cassandra configuration files. The main configuration file for Cassandra is cassandra.yaml. This config file allows us to control how nodes within a cluster are configured, including inter-node communication and replica placement. Snitch is the key value that we need to set in the config file. A snitch is basically a way to identify which Region and Availability zones each cluster member belongs to. It provides information about the network topology to ensure that requests are routed efficiently. Cassandra also has replication strategies that place replicas based upon the information provided by snitch. There are many types of snitches. We will use EC2Snitch in this instance, as all our cluster nodes are located within a single area.
The snitch value will be set as shown below.
We must also group multiple nodes because we are using them. This will be done by setting the seeds key in the Cassandra.yaml configuration file. Cassandra nodes use seed to find each other and learn the topology of a ring. It is used to find the cluster during startup.
Take, for example:
Node 1: Set the value of the seeds as shown below.
Similar procedure for the other nodes
Node 2
Node 3
Node 4
Node 5
Node 6
Cassandra nodes make use of this list of hosts in order to find each other and learn about the topology of the ring. The nodetool utility provides a command-line interface for managing a cluster. This command will check the status of the cluster. The owns field indicates how much data each node has. As you can see, the owns field is null because there are no keyspaces/databases. Let’s create a sample keyspace. We will create a keyspace that includes a data replication strategy and a replication factor. Replication strategy identifies the nodes where replicas will be placed. The replication factor is the number of replicas in a cluster. Because our cluster is distributed across multiple availability zones, we will use NetworkTopology replication strategy. NetworkTopologyStrategy places replicas on distinct racks/AZs as sometimes, nodes in the same rack/AZ might usually fail at the same time due to power, cooling or network issues.
Let’s set the replication factor at 3 for our “first keyspace”.
CREATE KEYSPACE “first” WITH REPLICATION =’class’ :’NetworkTopologyStrategy’, ‘us-east’ : 3;1CREATE KEYSPACE “first” WITH REPLICATION =’class’ :’NetworkTopologyStrategy’, ‘us-east’ : 3;The above CQL command creates a database/keyspace ‘first’ with class as NetworkTopologyStrategy and 3 replicas in us-east (In this case, one replica in AZ/rack 1a, one replica in rack AZ/1b and one replica in rack AZ/1c). Cassandra uses a Cassandra Query Language Shell command prompt, also known under CQLSH. This allows users to communicate with it. Using CQLSH, you can execute queries using Cassandra Query Language (CQL).
Next, we will create a table user that contains 5 records to test.
CREATE TABLE user (user_id text.login text.region text.PRIMARY KEY.user_id.)1CREATE TAABLE user(u