Long-Term Storage In Kafka

Kevin Feasel

2017-09-18

Hadoop

Jay Kreps shows us that you can use Kafka as a primary data store:

The short answer is that it’s not insane, people do this all the time, and Kafka was actually designed for this type of usage. But first, why might you want to do this? There are actually a number of use cases, here’s a few:

  1. You may be building an application using event sourcing and need a store for the log of changes. Theoretically you could use any system to store this log, but Kafka directly solves a lot of the problems of an immutable log and “materialized views” computed off of that. The New York Times does this for all their article data as the heart of their CMS.

  2. You may have an in-memory cache in each instance of your application that is fed by updates from Kafka. A very simple way of building this is to make the Kafka topic log compacted, and have the app simply start fresh at offset zero whenever it restarts to populate its cache.

  3. Stream processing jobs do computation off a stream of data coming via Kafka. When the logic of the stream processing code changes, you often want to recompute your results. A very simple way to do this is just to reset the offset for the program to zero to recompute the results with the new code. This sometimes goes by the somewhat grandiose name of The Kappa Architecture.

  4. Kafka is often used to capture and distribute a stream of database updates (this is often called Change Data Capture or CDC). Applications that consume this data in steady state just need the newest changes, however new applications need start with a full dump or snapshot of data. However performing a full dump of a large production database is often a very delicate and time consuming operation. Enabling log compaction on the topic containing the stream of changes allows consumers of this data to simple reload by resetting to offset zero.

This is a great article, especially the part about how Kafka is not the data storage system; there are reasons you’d want data in other formats as well (like relational databases, which are great for random access queries).

Related Posts

Five Books For Learning Kafka

Data Flair has a guide to five books to help you learn Apache Kafka: The book “Kafka: The Definitive Guide” is written by engineers from Confluent andLinkedIn who are responsible for developing Kafka. They explain how to deploy production Kafka clusters, write reliable event-driven microservices, and build scalable stream-processing applications with this platform. It contains detailed examples as well. […]

Read More

Push-Based Alerting With Kafka Streams

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams: Now we can query from it and show the aggregate window timestamp alongside the result: ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \ HOST, INVALID_LOGIN_COUNT \ FROM INVALID_USERS_LOGINS_PER_HOST; 1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 1 1521646620000 | […]

Read More

Categories

September 2017
MTWTFSS
« Aug Oct »
 123
45678910
11121314151617
18192021222324
252627282930