
What is Apache Kafka®?
Apache Kafka® is the leading distributed streaming and queuing technology for large-scale, always-on applications. Kafka has built-in features of horizontal scalability, high-throughput, and low-latency. It is highly reliable, has high-availability, and allows geographically distributed data streams and stream processing applications.
Kafka has garnered a lot of interest since 2011 and is now being broadly used by many organisations across the globe including Netflix, Twitter, Spotify, and Uber.
Kafka has a similar shared-nothing, replicated architecture to Apache Cassandra that allows it to operate with similar extreme levels of scalability, reliability, and availability. In any Big Data application, Kafka has three core functions:
- Message Transport – Enabling transportation of data between various publisher and subscriber endpoints.
- Message Aggregation – Aggregating a number of various data streams for use by distributed processing applications.
- Message Store – Storing data streams as a cache in a replicated, fault-tolerant storage environment.
Use of Messaging Queuing Technology
Messaging queuing technology has been in use for some time now—enabling different applications and endpoints to communicate with each other. These applications can be on the same or different physical platforms. With queuing technology, the producer and consumer do not need to interact with the message queue at the same time. Messages placed onto the queue are stored until the consumer retrieves them.
Kafka is part of the general family of technologies that provide queuing, messaging, and/or streaming—sometimes known as a streaming engine. Fundamentally it is a system that takes streams of messages from applications known as producers, stores them reliably on a central cluster (known as brokers) and allows those streams to be received by applications that process the messages (applications called consumers).
Apache Kafka® Advantages
Most Message Service brokers either don’t persist messages at all or only store them until they are consumed and acknowledged. Kafka has a completely different model where it stores all messages before and even after they are successfully received by subscribing applications.
Kafka comes with the following advantages:
- Highly scalable
- Fault-tolerant
- Highly reliable due to built-in replication, supporting true always-on operation
- Handles high-velocity and high-volume data
- Supports message throughput of thousands of messages per second
- Little or no downtime
- Rich ecosystem including many connectors
- Integrates with external stream processing layers
- Strong open source community is supporting it
- Powers some of the biggest organisations across the globe.
Apache Kafka® Architecture
Kafka’s main architectural components include Producers, Topics, Consumers, Consumer Groups, Clusters, Brokers, Partitions, Replicas, Leaders, and Followers.
Important aspects to note:
- Kafka clusters can have one or more brokers.
- Brokers can host multiple replicas.
- Topics can have one or more partitions.
- A broker can host zero or one replica per partition.
- A partition has one leader replica and zero or more follower replicas.
Kafka enables developers to meet important requirements for applications such as:
- Safely dealing with spikes in workloads by temporarily storing data until it’s ready for processing;
- Allowing data from one source to be processed by many downstream applications;
- Allowing data from many sources to be gathered in a single sport for processing by downstream applications; and
- Facilitating stream processing where calculations can be performed on the streams of data.
Apache Kafka® Use Cases
Kafka can be used for the variety of use cases such as generating matrix, log aggregation, messaging, audit trail, stream processing, website activity tracking, monitoring, and more.
For example in event sourcing, you consider the sequence of changes made (as opposed to the result of those changes) to be the source of truth for the state of your application. Kafka is really useful for this use case, as it is designed for reliably storing a series of events and can provide an ideal data store for this purpose.
NB: This article originally appeared in a slightly longer form and permission has been kindly given by Instaclustr®, one of the founding members of the OpenSI, for it reuse.