Apache Kafka Introduction

Apache Kafka Introduction

Integrating systems that every day grow larger is a complex task. Apache Kafka is a software that tries to solve this by using events. In this article, we are going to give you an Apache Kafka introduction. So that you get an understanding of what it is and how to get started with it.

Why are events different from messages?

You might think that events sound awfully a lot like messages, but that isn’t really true.

  • With events, we can guarantee the ordering which we can’t always rely on with messages.
  • Scalability, which is of course not always true, however, usually, it is easier to achieve a horizontal scaling with an event stream.
  • Event streams allow us to use polling instead of push which is quite a big deal since what if one of many consumers is a lot slower or faster than the other ones? With polling, we can process data in a reactive way so that we easier can handle backpressure.

Apache Kafka Introduction

Apache Kafka is an event ledger that we can feed data as events, and then different systems can integrate and consume these events. Think of it as a distributed commit log, where you always append more data.

Horizonal vs. vertical scaling

One of the biggest benefits of Apache Kafka is the speed and scalability. It scales linear, meaning that if you add more nodes to the cluster, you can very easily scale up, and this is what is referred to as horizontal scaling. You don’t need to buy more expensive and powerful hardware, instead, you simply add more nodes and scale horizontally.

As mentioned, Apache Kafka is also incredibly fast. Jay Kreps posted a really interesting post Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines), that I recommend reading. Apache Kafka is extremely optimized and quick for many reasons, a couple of reasons are:

  • Rather than sending data at the application layer, Apache Kafka basically calls the OS kernel directly.
  • All objects are immutable which means that the disk can be accessed in a sequential manner and random I/O operations can be avoided.
  • Batching data into chunks, which minimizes cross machine latency.

All of these optimizations means that Apache Kafka nearly operates at network speeds, and also how Jay Kreps in his blog post achieved 2 million writes per second.

When should you use Apache Kafka?

Apache Kafka is usually used as an integration hub between multiple servers and services. If your system exists includes multiple servers or services that need to integrate with each other then you could probably benefit from Apache Kafka. Kafka makes complete sense in a microservices environment.

Kafka is also a popular tool for Big Data Ingest. Sometimes you need to buffer loads of data, for example, if you have a consumer that can’t handle too much at the same time. Apache Kafka can be used to buffer the data so that the consumer can come and ask for data when it is ready for it. Apache Kafka also lets you go back in history and check what has occurred.

What are events in Apache Kafka?

Events are stored as records, and similar to documents inNoSQL Couchbase events are stored with a key, a value and a timestamp. The biggest difference though is that these records are immutable. These records are a piece of history of what has happened, and similar to reality, the history cannot be changed, at least not until someone invents time travel. These records are persisted to disk for durability.

Good to know

Before we get started with an example, there are a couple of stuff that I want to bring up that is good to know.

Topics & partitions

A topic is a logical name with at least one (or more partitions). Partitions are replicated and the ordering is guaranteed for a partition because each event gets a sequential ID in a partition when it arrives, so the first event might have an ID of 1, the next 2, and etc. This also makes it possible for consumers to retrieve events based on an offset.

Consumer groups

A consumer group is a logical name for one (or more) consumers. Perhaps you have scaled up three of the same processes that are going to consume the same data from Kafka in order to parallelize some workload. When doing that, you don’t want the same event to be consumed by many consumers. You want to make sure that only one consumer consumes each event. Consumer groups take care of this by load balancing and spreading out the consumption of all the consumers in the group.

Running Apache Kafka in Docker

The next step in the Apache Kafka introduction is to get it up and running. To quickly get started we are going to run a single broker (node) cluster in Docker. If you don’t have any idea what Docker is, then I recommend checking out the docker introduction post.

Create a file named docker-compose.yml anywhere on your disk containing the following.

Navigate there with terminal prompt and run docker-compose up -d. This is going to start an Apache Kafka broker node and zookeeper to manage the cluster.

Final words

We have briefly had an Apache Kafka introduction where we went through what Kafka is, what the purpose is and what you need in order to get started with it.

In the Apache Kafka introduction, we set up Apache Kafka and Zookeeper that it depends on in Docker. Now we are finally ready to start producing and consuming events. The next episode is going to be more niched on Java development with Spring. We are going to use spring-kafka to quickly connect and start using our Kafka cluster.

You may also like

6 Comments

  1. Got the following error while running docker compose:
    $ docker-compose up -d
    Creating network “dockker_default” with the default driver
    Pulling db (klaemo/couchdb:latest)…
    latest: Pulling from klaemo/couchdb
    f49cf87b52c1: Pull complete
    397f5fffaa7c: Pull complete
    5aa9fe958ae4: Pull complete
    2c2af6b738c5: Pull complete
    9538593ea0d6: Pull complete
    39b74213f501: Pull complete
    650d29cd56f8: Pull complete
    8d6d093e4d7e: Pull complete
    fe0bf6aa8fed: Pull complete
    7adebd64aec1: Pull complete
    Digest: sha256:4013efd4676b7635f7ee348c38a4c20345aa23e0de3acf08c9d4df80c94d6186
    Status: Downloaded newer image for klaemo/couchdb:latest
    Pulling jboss (tpires/alpine-jboss-wildfly:10.0.0.Final)…
    10.0.0.Final: Pulling from tpires/alpine-jboss-wildfly
    f5f17db9ef7c: Pull complete
    8c7b09327677: Pull complete
    25c47f12254e: Pull complete
    a3ed95caeb02: Pull complete
    92985017e5e6: Pull complete
    9e60ce1e1433: Pull complete
    e9710a599596: Pull complete
    Digest: sha256:b1abb18c526dbae95bbe8bd00e97e83306c6974d925c8b31e90bf6999a6faa53
    Status: Downloaded newer image for tpires/alpine-jboss-wildfly:10.0.0.Final
    Pulling appcontainer (nonroot_compojure:latest)…
    ERROR: repository nonroot_compojure not found: does not exist or no pull access

    1. It doesn’t look like anything related to anything in my article, but I will try to help out anyways.

      But it looks like one of your images depend on some image that does not exist or something that requires authentication. Try docker login, and if that doesn’t help, find another image to use.

      1. Never mind. I was running into a few issues. I had multiple docker files in the same directory and one of them is an old docker-compose.yml. So I ran the following which worked.

        docker-compose -f kafka.yml up -d

        Thank you

  2. Hi
    Nice post
    When are you going to write the next episode (I mean, one concerning spring-kaka usage)? Can’t wait to read it!

Leave a Reply

Your email address will not be published. Required fields are marked *