What is Apache Cassandra®?
Apache Cassandra® is an open source non-relational, or NoSQL, a database that enables continuous availability, tremendous scale, and data distribution across multiple data centers and cloud availability zones.
Simply put, Cassandra provides a highly reliable data storage engine for applications requiring immense scale.
Open Source Apache Cassandra®
Apache Cassandra was originally developed at Facebook, and in 2008 it was released as an open source project on Google Code by the company.
In 2010, it became a top-level Apache project.
The open source version of the Cassandra database is used by some of the largest technology companies in the world to run mission-critical applications.
It is widely known that the largest deployment of the open source version of the Cassandra database is at Apple. Netflix is also a very large user of open source Apache Cassandra—the foundation for big data. It is estimated that Cassandra is deployed by over 50% of the Fortune 500 companies.
How Cassandra Works
Cassandra has been designed with scale, performance, and continuous availability as the foundation architecture principles. Cassandra operates using a masterless ring architecture—it does not rely on a master-slave relationship.
In Cassandra, all nodes play an identical role; there is no concept of a master node, with all nodes communicating with each other via a distributed, scalable protocol.
Writes are distributed among nodes using a hash function and reads are channelled onto specific nodes.
Cassandra stores data by dividing the data evenly around its cluster of nodes. Each node is responsible for part of the data. The act of distributing data across nodes is referred to as data partitioning.
Cassandra is a built-for-scale architecture, meaning that it is capable of handling large amounts of data and millions of concurrent users or operations per second—even across multiple data centers—as easily as it can manage much smaller amounts of data and user traffic. To add more capacity, you simply add new nodes to an existing cluster without having to take it down first.
Unlike other master-slave or sharded systems, Cassandra has no single point of failure and therefore is capable of offering true continuous availability and uptime.
The key components of the Cassandra architecture include the following terms and concepts:
- Node: the specific instance where data is stored.
- Rack: a set of nodes with a correlated chance of failure.
- Datacenter: a collection of related nodes with a complete set of data.
- Cluster: a component that contains one or more data centers.
- Commit log: it is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
- Mem-table: a mem-table is a memory-resident data structure. A mem-table is a write-back cache residing in memory that has not been flushed to disk yet.
- SSTable: a Sorted String Table (SSTable) ordered immutable key value map. It is basically an efficient way of storing large sorted data segments in a file.
- Bloom filter: is an extremely fast way to test the existence of a data structure in a set. A bloom filter can tell if an item might exist in a set or definitely does not exist in the set. Bloom filters are a good way of avoiding expensive I/O operation.
NB: This article originally appeared in a slightly longer form and permission has been kindly given by Instaclustr®, one of the founding members of the OpenSI, for it reuse.
December 10, 2021
Instaclustr and the University of Canberra continue to build on their existing partnership and are pleased to announce the formation of the Open Source Institute.