Press "Enter" to skip to content

Category: Kafka / Flink

Tips for Scaling Apache Kafka

Narendra Lakshmana Gowda tunes a Kafka cluster:

Apache Kafka is known for its ability to process a huge quantity of events in real time. However, to handle millions of events, we need to follow certain best practices while implementing both Kafka producer services and consumer services.

Before start using Kafka in your projects, let’s understand when to use Kafka:

Much of the advice is pretty standard for performance tuning in Kafka, like setting batch size and linger time on the producer or managing consumers in a consumer group.

Comments closed

Handling Errors in Apache Flink Apps

Alexis Tekin and Jeremy Ber handle an error:

Data streaming applications continuously process incoming data, much like a never-ending query against a database. Unlike traditional database queries where you request data one time and receive a single response, streaming data applications constantly receive new data in real time. This introduces some complexity, particularly around error handling. This post discusses the strategies for handling errors in Apache Flink applications. However, the general principles discussed here apply to stream processing applications at large.

Read on to see how this all works when you’re hosting a Flink application. This directly relates to Flink applications that live in AWS, though very little in the article is AWS-specific.

Comments closed

Musings on the State of Apache Kafka and Apache Flink

Adron Hall shares some thoughts:

I’ve worked with (** references at end of article) a number of Apache projects over the years, often pretty closely; Apache Cassandra, Apache Flink, Apache Kafka, Apache Zookeeper and numerous others. But the last few years I’ve not been immediately hands on with the technology. A few questions popped up recently, that fortunately I was able to answer based on existing knowledge, but it made me real curious about what the SITREP (Situational Report) is for the Apache Kafka and Flink Projects for TODAY, i.e. rolling into 2025! The following is a quick dive into the history and then the latest details (and drama?) with Apache Kafka, Flink, and tangentially some other projects (Zookeeper?).

Click through to see how the pieces fit together.

Comments closed

Securing a Kafka Ecosystem

Riya has a breakdown of how to protect your Apache Kafka installation and resources around it:

Apache Kafka is the backbone of many real-time data pipelines, making security an essential aspect of its deployment. Protecting your Kafka ecosystem involves implementing encryption to safeguard data, authentication to verify user identities, and authorization to control access. This guide provides a comprehensive overview of these three pillars of securing Kafka, complete with code examples to help you implement best practices.

Click through for demonstrations of encryption, authentication, and authorization.

Comments closed

Handling a Consumer Fetch Request in Kafka

Multiple Confluent employees (who apparently don’t get to have names this time around) wrap up a series:

It’s been a long time coming, but we’ve finally arrived at the fourth and final installment of our blog series. In this series, we’ve been peeling back the layers of Apache Kafka® to get a deeper understanding of how best to interact with the cluster using producer and consumer clients.

Read on for the final part, as well as links to previous parts if you missed them.

Comments closed

Apache Kafka 3.9 Now Available

Colin McCabe announces Apache Kafka 3.9:

We are proud to announce the release of Apache Kafka 3.9.0. This is a major release, the final one in the 3.x line. This will also be the final major release to feature the deprecated Apache ZooKeeper® mode. Starting in 4.0 and later, Kafka will always run without ZooKeeper.

That’s a pretty big change, but there are also quite a few other significant changes here to check out.

Comments closed

Minimizing Latency in Kafka Streaming Applications using APIs

Abhishek Goswami doesn’t want to slow down the stream:

Kafka is widely adopted for building real-time streaming applications due to its fault tolerance, scalability, and ability to process large volumes of data. However, in general, Kafka streaming consumers work best only in an environment where they do not have to call external APIs or databases. In a situation when a Kafka consumer must make a synchronous database or API call, the latency introduced by network hops or I/O operations adds up and accumulates easily (especially when the streaming pipeline is performing an initial load of a large volume of data before starting CDC). This can significantly slow down the streaming pipeline and result in the blowing of system resources impacting the throughput of the pipeline. In extreme situations, this may even become unsustainable as Kafka consumers may not be able to commit offsets due to increased latency before the next polling call and get continuously rebalanced by the broker, practically not processing anything yet incrementally consuming more system resources as time passes.

This is a real problem faced by many streaming applications. In this article, we’ll explore some effective strategies to minimize latency in Kafka streaming applications where external API or database calls are inevitable. We’ll also compare these strategies with the alternative approach of separating out the parts of the pipeline that require these external interactions into a separate publish/subscribe-based consumer.

Read on to understand the causes of this latency and several patterns you can use to limit it.

Comments closed

Preparing a Fetch Operation in a Kafka Consumer

Danica Fine continues a series on Kafka internals:

Welcome back to the third installment of our blog series where we’re diving into the beautiful black box that is Apache Kafka® to better understand how we interact with the cluster through producer and consumer clients.

Earlier in the series, we took a look at the Kafka producer to see how the client works before following a produce request as it’s processed by the cluster.

In this post, we’ll switch our attention to Kafka Consumer clients to see how consumers interact with the brokers, coordinate their partitions, and send requests to read data from your Kafka topics.

Read on to see what it takes for a consumer to operate in Apache Kafka.

Comments closed

Working with the Apache Flink Table API

Martijn Visser takes us through the Flink Table API:

Apache Flink® offers a variety of APIs that provide users with significant flexibility in processing data streams. Among these, the Table API stands out as one of the most popular options. Its user-friendly design allows developers to express complex data processing logic in a clear and declarative manner, making it particularly appealing for those who want to efficiently manipulate data without getting bogged down in intricate implementation details.

At this year’s Current, we introduced support for the Flink Table API in Confluent Cloud for Apache Flink® to enable customers to use Java and Python for their stream processing workloads. The Flink Table API is also supported in Confluent Platform for Apache Flink®, which launched in limited availability and supports all Flink APIs out of the box.

This introduction highlights its capabilities, how it integrates with other Flink APIs, and provides practical examples to help you get started. Whether you are working with real-time data streams or static datasets, the Table API simplifies your workflow while maintaining high performance and flexibility. If you want to go deeper into the details of how Table API works, we encourage you to check out our Table API developer course.

Read on to learn more information about how the Table API works in comparison to other interfaces.

Comments closed

Tracking Airport Traffic with Flink, Kafka, and NiFi

Tim Spann builds an app:

The above link utilizes the standard REST link and enhances it by setting the beginning date using NiFi’s Expression language to get the current time in UNIX format in seconds. In this example, I am looking at the last week of data for the airport departures and arrivals in the second URL.

We iterate through a list of the largest airports in the United States doing both departures and arrivals since they use the same format.

Read the article to learn more about how you can tie it all together. You can also check out Tim’s GitHub repo to grab the code.

Comments closed