Integrating systems that every day grow larger is a complex task. Apache Kafka is a software that tries to solve this by using events. In this article, we are going to give you an Apache Kafka introduction. So that you get an understanding of what it is and how to get started with it.
Why are events different from messages?
You might think that events sound awfully a lot like messages, but that isn’t really true.
- With events, we can guarantee the ordering which we can’t always rely on with messages.
- Scalability, which is of course not always true, however, usually, it is easier to achieve a horizontal scaling with an event stream.
- Event streams allow us to use polling instead of push which is quite a big deal since what if one of many consumers is a lot slower or faster than the other ones? With polling, we can process data in a reactive way so that we easier can handle backpressure.
Apache Kafka Introduction
Apache Kafka is an event ledger that we can feed data as events, and then different systems can integrate and consume these events. Think of it as a distributed commit log, where you always append more data.
One of the biggest benefits of Apache Kafka is the speed and scalability. It scales linear, meaning that if you add more nodes to the cluster, you can very easily scale up, and this is what is referred to as horizontal scaling. You don’t need to buy more expensive and powerful hardware, instead, you simply add more nodes and scale horizontally.
As mentioned, Apache Kafka is also incredibly fast. Jay Kreps posted a really interesting post Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines), that I recommend reading. Apache Kafka is extremely optimized and quick for many reasons, a couple of reasons are:
- Rather than sending data at the application layer, Apache Kafka basically calls the OS kernel directly.
- All objects are immutable which means that the disk can be accessed in a sequential manner and random I/O operations can be avoided.
- Batching data into chunks, which minimizes cross machine latency.
All of these optimizations means that Apache Kafka nearly operates at network speeds, and also how Jay Kreps in his blog post achieved 2 million writes per second.
When should you use Apache Kafka?
Apache Kafka is usually used as an integration hub between multiple servers and services. If your system exists includes multiple servers or services that need to integrate with each other then you could probably benefit from Apache Kafka. Kafka makes complete sense in a microservices environment.
Kafka is also a popular tool for Big Data Ingest. Sometimes you need to buffer loads of data, for example, if you have a consumer that can’t handle too much at the same time. Apache Kafka can be used to buffer the data so that the consumer can come and ask for data when it is ready for it. Apache Kafka also lets you go back in history and check what has occurred.
What are events in Apache Kafka?
Events are stored as records, and similar to documents inNoSQL Couchbase events are stored with a key, a value and a timestamp. The biggest difference though is that these records are immutable. These records are a piece of history of what has happened, and similar to reality, the history cannot be changed, at least not until someone invents time travel. These records are persisted to disk for durability.
Good to know
Before we get started with an example, there are a couple of stuff that I want to bring up that is good to know.
Topics & partitions
A topic is a logical name with at least one (or more partitions). Partitions are replicated and the ordering is guaranteed for a partition because each event gets a sequential ID in a partition when it arrives, so the first event might have an ID of 1, the next 2, and etc. This also makes it possible for consumers to retrieve events based on an offset.
A consumer group is a logical name for one (or more) consumers. Perhaps you have scaled up three of the same processes that are going to consume the same data from Kafka in order to parallelize some workload. When doing that, you don’t want the same event to be consumed by many consumers. You want to make sure that only one consumer consumes each event. Consumer groups take care of this by load balancing and spreading out the consumption of all the consumers in the group.
Running Apache Kafka in Docker
The next step in the Apache Kafka introduction is to get it up and running. To quickly get started we are going to run a single broker (node) cluster in Docker. If you don’t have any idea what Docker is, then I recommend checking out the docker introduction post.
Create a file named docker-compose.yml anywhere on your disk containing the following.
Navigate there with terminal prompt and run
docker-compose up -d. This is going to start an Apache Kafka broker node and zookeeper to manage the cluster.
We have briefly had an Apache Kafka introduction where we went through what Kafka is, what the purpose is and what you need in order to get started with it.
In the Apache Kafka introduction, we set up Apache Kafka and Zookeeper that it depends on in Docker. Now we are finally ready to start producing and consuming events. The next episode is going to be more niched on Java development with Spring. We are going to use spring-kafka to quickly connect and start using our Kafka cluster.