Press "Enter" to skip to content

Category: Kafka / Flink

Kafka Internals: Handling a Producer Request

Danica Fine continues a series on Kafka internals:

Welcome to the second installment of our blog series to understand the inner workings of the beautiful black box that is Apache Kafka®. 

We’re diving headfirst into Kafka to see how we actually interact with the cluster through producers and consumers. Along the way, we explore the configurations that affect each step of this epic journey and the metrics that we can use to more effectively monitor the process. 

In the last blog, we explored what the Kafka producer client does behind the scenes each time we call producer.send() (or similar, depending on your language of choice). In this post, we follow our brave hero, a well-formed produce request, that’s on its way to the broker to be processed and have its data stored on the cluster.

Click through to learn more about how it all works.

Comments closed

Testing Kafka Messages with RecordCaptor

Anton Belyaev shows off an open-source utility:

Let’s take a Telegram bot that forwards requests to the OpenAI API and returns the result to the user as an example. If the request to OpenAI violates the system’s security rules, the client will be notified. Additionally, a message will be sent to Kafka for the behavioral control system so that the manager can contact the user, explain that their request was too sensitive even for our bot, and ask them to review their preferences.

The interaction contracts with services are described in a simplified manner to emphasize the core logic. Below is a sequence diagram demonstrating the application’s architecture. I understand that the design may raise questions from a system architecture perspective, but please approach it with understanding — the main goal here is to demonstrate the approach to writing tests.

Read on to see how it all works, as well as links to Anton’s GitHub repo for testing in Kafka.

Comments closed

Comparing Azure Event Hubs to Apache Kafka

Dharmbir Kashyap makes a comparison:

In the realm of event streaming and real-time data processing, choosing the right platform is critical to the success of your project. Two of the most popular options available today are Azure Event Hub and Apache Kafka. Both platforms offer robust solutions for handling large volumes of streaming data, but they are designed with different architectures, features, and use cases in mind. This blog post will delve into the key differences between Azure Event Hub and Kafka, helping you determine which platform is best suited for your specific needs.

Read on for an overview of each product and where each product fits.

Comments closed

Updates in Apache Kafka 3.8

Josep Prat announces a slew of changes:

We are proud to announce the release of Apache Kafka 3.8.0. This release contains many new features and improvements. This blog post will highlight some of the more prominent features. For a full list of changes, be sure to check the release notes.

See the Upgrading to 3.8.0 from any version 0.8.x through 3.7.x section in the documentation for the list of notable changes and detailed upgrade steps.

This also puts Kafka one step closer to getting rid of its ZooKeeper dependency altogether.

Comments closed

Test Isolation with Kafka

Anton Belyaev builds some tests:

The experience of running Kafka in test scenarios has reached a high level of convenience thanks to the use of Test containers and enhanced support in Spring Boot 3.1 with the @ServiceConnection annotation. However, writing and maintaining integration tests with Kafka remains a challenge. This article describes an approach that significantly simplifies the testing process by ensuring test isolation and providing a set of tools to achieve this goal. With the successful implementation of isolation, Kafka tests can be organized in such a way that at the stage of result verification, there is full access to all messages that have arisen during the test, thereby avoiding the need for forced waiting methods such as Thread.sleep().

This method is suitable for use with Test containers, Embedded Kafka, or other methods of running the Kafka service (e.g., a local instance).

Click through for that approach.

Comments closed

Building a Full-Stack App with Kafka and Node.js

Lucia Cerchie builds an application:

A well-known debate: tabs or spaces? Sure, we could set up a Google Form to collect this data, but where’s the fun in that? Let’s settle the debate, Kafka-style. We’ll use the new confluent-kafka-javascript client (not in general availability yet) to build an app that produces the current state of the vote counts to a Kafka topic and consumes from that same topic to surface them to a JavaScript frontend. 

Why are we using this client in particular? It comes from Confluent and is intended for use with Apache Kafka® and Confluent Platform. It’s compatible with Confluent’s cloud offering as well. It builds on concepts from the two most popular Kafka JavaScript client libraries: KafkaJS and node-rdkafka. The functionality is based on node-rdkafka, however, it also provides a way to interface with the library via methods similar to those in KafkaJS due to their developer-friendy nature. There are two APIs: the first implements the functionality based on node-rdkafka; the second is a promisified API with the methods akin to those in KafkaJS. By choosing this client, we can access wide functionality and have a smooth developer experience via the dev-friendly methods.

Click through for the code and explanation. Meanwhile, tabs in my heart, spaces in my job.

Comments closed

Hot and Cold Partitions for Apache Kafka Data

Gautan Goswami splits the data:

At first, data tiering was a tactic used by storage systems to reduce data storage costs. This involved grouping data that was not accessed as often into more affordable, if less effective, storage array choices. Data that has been idle for a year or more, for example, may be moved from an expensive Flash tier to a more affordable SATA disk tier. Even though they are quite costly, SSDs and flash can be categorized as high-performance storage classes. Smaller datasets that are actively used and require the maximum performance are usually stored in Flash.

Cloud data tiering has gained popularity as customers seek alternative options for tiering or archiving data to a public cloud. Public clouds presently offer a mix of object and file storage options. Object storage classes such as Amazon S3 and Azure Blob (Azure Storage) deliver significant cost efficiency and all the benefits of object storage without the complexities of setup and management. 

Read on for an architecture that uses hot and cold tiers, as well as how you can set it up on an existing Kafka topic.

Comments closed

Transforming a REST API into a Data Stream

Lucia Cerchie and Dave Troiano build a stream:

In the space of APIs for consuming up-to-date data (say, events or state available within an hour of occurring) many API paradigms exist. There are file- or object-based paradigms, e.g., S3 access. There’s database access, e.g., direct Snowflake access. Last, we have decoupled client-server APIs, e.g., REST APIs, gRPC, webhooks, and streaming APIs. In this context, “decoupled” means that the client usually communicates with the server over a language-agnostic standard network protocol like HTTP/S, usually receives data in a standard format like JSON, and, in contrast to direct database access, typically doesn’t know what data store backs the API.

Of the above styles, more often than not, API developers settle on HTTP-based REST APIs for a number of reasons. They are incredibly popular. More developers know how to use REST APIs and are using them in production compared to other API technologies. For example, Rapid API’s 2022 State of APIs reports 69.3% of survey respondents using REST APIs in production, well above the percentage using alternatives like gRPC (8.2%), GraphQL (18.6%), or webhooks (34.6%). 

Click through for a demonstration of how to take an existing REST API and build a data stream out of it using Apache Kafka and Apache Flink.

Comments closed

Combining Flink SQL, Streamlit, and Kafka

Lucia Cerchie has a pair of posts. First up, Lucia sets the stage:

n part 1 of this series, we’ll make an app, hosted on Streamlit, that allows a user to select a stock, in this case SPY, or the SPDR S&P 500 ETF Trust. Upon selection, a live chart of the stock’s bid prices, calculated every five seconds, will appear.

What are the pieces that go into making this work? The source of the data is the Alpaca Market Data API. We’ll hook up a Kafka producer to the websocket stream and send data to a Kafka topic in Confluent Cloud. Then we’ll use Flink SQL within Confluent Cloud’s Flink SQL workspace to tumble an average bid price every five seconds. Finally, we’ll use a Kafka consumer to receive that data and populate it to a Streamlit component in real time. This frontend component will be deployed on Streamlit as well.

Part 2 then closes the trap:

In part one of this series, we walked through how to use Streamlit, Apache Kafka®, and Apache Flink® to create a live data-driven user interface for a market data application to select a stock (e.g., SPY) and discussed the structure of the app at a high level. First, data with information on stock bid prices is moved via an Alpaca websocket, then, it’s produced to a Kafka topic in Confluent Cloud where it is also processed with Flink SQL. 

Now comes the tricky part: running the Kafka consumer and producer in the same application.

Click through for a good demonstration of a practical solution. Lucia also has a GitHub repo with all of the code, a demo of the site in action, and some links to additional resources.

Comments closed

Dual-Write Issues and Kafka

Wade Waldron solves a common but difficult problem:

However, the dual-write problem isn’t unique to event-driven systems or Kafka. It occurs in many situations involving different technologies and architectures.

When I started building event-driven systems, I encountered the dual-write problem almost immediately. I eventually learned effective ways to solve it but tripped over some anti-patterns along the way.

I want to break down the details of the dual-write problem so you can understand how it occurs and avoid making the same mistakes I did. I’ll outline a few anti-patterns that might look promising, but don’t solve the problem. Finally, we’ll look at accepted solutions that eliminate the dual-write problem.

Read on for a few techniques that will not work (assuming you are using Apache Kafka to flow events into some external systems) and some that will.

Comments closed