Integrating systems that every day grow larger is a complex task. Apache Kafka is a software that tries to solve this by using events. In this article, we are going to give you an Apache Kafka introduction. So that you get an understanding of what it is and how to get started with it.
You might think that events sound awfully a lot like messages, but that isn’t really true.
Apache Kafka is an event ledger that we can feed data as events, and then different systems can integrate and consume these events. Think of it as a distributed commit log, where you always append more data.
One of the biggest benefits of Apache Kafka is the speed and scalability. It scales linear, meaning that if you add more nodes to the cluster, you can very easily scale up, and this is what is referred to as horizontal scaling. You don’t need to buy more expensive and powerful hardware, instead, you simply add more nodes and scale horizontally.
As mentioned, Apache Kafka is also incredibly fast. Jay Kreps posted a really interesting post , that I recommend reading. Apache Kafka is extremely optimized and quick for many reasons, a couple of reasons are:
All of these optimizations means that Apache Kafka nearly operates at network speeds, and also how Jay Kreps in his blog post achieved 2 million writes per second.
Apache Kafka is usually used as an integration hub between multiple servers and services. If your system exists includes multiple servers or services that need to integrate with each other then you could probably benefit from Apache Kafka. Kafka makes complete sense in a microservices environment.
Kafka is also a popular tool for Big Data Ingest. Sometimes you need to buffer loads of data, for example, if you have a consumer that can’t handle too much at the same time. Apache Kafka can be used to buffer the data so that the consumer can come and ask for data when it is ready for it. Apache Kafka also lets you go back in history and check what has occurred.
Events are stored as records, and similar to documents inNoSQL Couchbase events are stored with a key, a value and a timestamp. The biggest difference though is that these records are immutable. These records are a piece of history of what has happened, and similar to reality, the history cannot be changed, at least not until someone invents time travel. These records are persisted to disk for durability.
Before we get started with an example, there are a couple of stuff that I want to bring up that is good to know.
A topic is a logical name with at least one (or more partitions). Partitions are replicated and the ordering is guaranteed for a partition because each event gets a sequential ID in a partition when it arrives, so the first event might have an ID of 1, the next 2, and etc. This also makes it possible for consumers to retrieve events based on an offset.
A consumer group is a logical name for one (or more) consumers. Perhaps you have scaled up three of the same processes that are going to consume the same data from Kafka in order to parallelize some workload. When doing that, you don’t want the same event to be consumed by many consumers. You want to make sure that only one consumer consumes each event. Consumer groups take care of this by load balancing and spreading out the consumption of all the consumers in the group.
The next step in the Apache Kafka introduction is to get it up and running. To quickly get started we are going to run a single broker (node) cluster in Docker. If you don’t have any idea what Docker is, then I recommend checking out the docker introduction post.
Create a file named docker-compose.yml anywhere on your disk containing the following.
version: '2'
services:
zookeeper:
image: wurstmeister/zookeeper
ports:
- "2181:2181"
kafka:
image: wurstmeister/kafka
ports:
- "9092:9092"
environment:
KAFKA_ADVERTISED_HOST_NAME: localhost
KAFKA_CREATE_TOPICS: "Topic1:1:3"
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
volumes:
- /var/run/docker.sock:/var/run/docker.sock
Navigate there with terminal prompt and run docker-compose up -d
. This is going to start an Apache Kafka broker node and zookeeper to manage the cluster.
We have briefly had an Apache Kafka introduction where we went through what Kafka is, what the purpose is and what you need in order to get started with it.
In the Apache Kafka introduction, we set up Apache Kafka and Zookeeper that it depends on in Docker. Now we are finally ready to start producing and consuming events. The next episode is going to be more niched on Java development with Spring. We are going to use spring-kafka to quickly connect and start using our Kafka cluster.
Streams has become a very popular way to process a collection of elements. But a…
A lot of focus on my previous blogs has been on how to build micro…
Learn how to work with high-quality reference genomes in this article by Tiago Antao, a…
Garbage collection is one of the key concepts of Java programming and up to now…
Learn about convolution in this article by Sandipan Dey, a data scientist with a wide…
Lombok comes with a very convenient way of creating immutable objects with the builder pattern.…