Long-Term Storage In Kafka

Kevin Feasel

2017-09-18

Hadoop

Jay Kreps shows us that you can use Kafka as a primary data store:

The short answer is that it’s not insane, people do this all the time, and Kafka was actually designed for this type of usage. But first, why might you want to do this? There are actually a number of use cases, here’s a few:

  1. You may be building an application using event sourcing and need a store for the log of changes. Theoretically you could use any system to store this log, but Kafka directly solves a lot of the problems of an immutable log and “materialized views” computed off of that. The New York Times does this for all their article data as the heart of their CMS.

  2. You may have an in-memory cache in each instance of your application that is fed by updates from Kafka. A very simple way of building this is to make the Kafka topic log compacted, and have the app simply start fresh at offset zero whenever it restarts to populate its cache.

  3. Stream processing jobs do computation off a stream of data coming via Kafka. When the logic of the stream processing code changes, you often want to recompute your results. A very simple way to do this is just to reset the offset for the program to zero to recompute the results with the new code. This sometimes goes by the somewhat grandiose name of The Kappa Architecture.

  4. Kafka is often used to capture and distribute a stream of database updates (this is often called Change Data Capture or CDC). Applications that consume this data in steady state just need the newest changes, however new applications need start with a full dump or snapshot of data. However performing a full dump of a large production database is often a very delicate and time consuming operation. Enabling log compaction on the topic containing the stream of changes allows consumers of this data to simple reload by resetting to offset zero.

This is a great article, especially the part about how Kafka is not the data storage system; there are reasons you’d want data in other formats as well (like relational databases, which are great for random access queries).

Related Posts

Deploying Cloudera Enterprise On Azure

Xavier Morera announces a new Cloudera course: You will start by learning the Microsoft Azure services required to deploy a secure, elastic, Cloudera Enterprise cluster. These core services include security, networking, virtual machine management, and storage, just to name a few. Then, you’ll learn best practices and patterns for cloud-based clusters, including tips and caveats for security […]

Read More

Working With The Databricks API Via Powershell

Gerhard Brueckl has a Powershell module for interacting with Databricks, either Azure or AWS: As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended […]

Read More

Categories

September 2017
MTWTFSS
« Aug Oct »
 123
45678910
11121314151617
18192021222324
252627282930