Press "Enter" to skip to content

Category: Streaming

Optimizing Async Sinks in Flink

Hong Liang Teoh speeds things up:

When designing a Flink data processing job, one of the key concerns is maximising job throughput. Sink throughput is a crucial factor because it can determine the entire job’s throughput. We generally want the highest possible write rate in the sink without overloading the destination. However, since the factors impacting a destination’s performance are variable over the job’s lifetime, the sink needs to adjust its write rate dynamically. Depending on the sink’s destination, it helps to tune the write rate using a different RateLimitingStrategy.

This post explains how you can optimise sink throughput by configuring a custom RateLimitingStrategy on a connector that builds on the AsyncSinkBase (FLIP-171). In the sections below, we cover the design logic behind the AsyncSinkBase and the RateLimitingStrategy, then we take you through two example implementations of rate limiting strategies, specifically the CongestionControlRateLimitingStrategy and TokenBucketRateLimitingStrategy.

Read on for some tips on creating a rate limiting strategy for a sink.

Comments closed

Motion Detecting and Alerting with Kafka and ksqlDB

Wei Rui and Yinsidi Jiao take us through a scenario:

Managing IoT (Internet of Things) devices and their produced data or events can be a challenge. On one hand, IoT devices usually generate massive amounts of data. On the other hand, IoT hardware has many limitations to process the data generated, such as cost, physical size, efficiency, and availability. You need a back-end system with high scalability and availability to process the growing volume of data. Things become more challenging when dealing with numerous devices and events in real time, and considering the required availability, latency, scalability, and agility for different usage and scenarios.

For Confluent Hackathon 2022, we built an end-to-end motion detection and alerting system, which currently acts as a home surveillance system, on top of Apache Kafka® and ksqlDB to demonstrate how easy it is to build IoT solutions by leveraging Confluent Cloud.

Read on to see how it works.

Comments closed

Apache Flink 1.16 Released

Godfrey He makes an announcement:

To reduce the cost of migrating Hive to Flink, we introduce HiveServer2 Endpoint and Hive Syntax Improvements in this version:

The HiveServer2 Endpoint allows users to interact with SQL Gateway with Hive JDBC/Beeline and migrate with Flink into the Hive ecosystem (DBeaver, Apache Superset, Apache DolphinScheduler, and Apache Zeppelin). When users connect to the HiveServer2 endpoint, the SQL Gateway registers the Hive Catalog, switches to Hive Dialect, and uses batch execution mode to execute jobs. With these steps, users can have the same experience as HiveServer2.

Read on for a pretty large hit list.

Comments closed

Debugging Stream Table Joins

Philip Schmitt dives in to a problem:

Joining two topics to aggregate the data is one of the fundamental operations in stream processing. But that’s not to say that it’s simple. Let me show you what can go wrong! This article chronicles my journey to join two Apache Kafka topics—stumbling into and out of various pitfalls. I‘m going to show you…

– How to debug co-partitioning with kcat (formerly kafkacat)

– How to avoid the number one pitfall of using kcat

– Stream–table join semantics in action

There’s a lot of useful information in this post.

Comments closed

Designing Event Streams for Kafka

Dave Shook announces a new course:

Properly designing your events and event streams is essential for any event-driven architecture. Precisely how you design and implement them will significantly affect not only what you can do today, but what you can do tomorrow. For such a critical part of any data infrastructure, most event streaming tutorials gloss over event design.

In the new course on Confluent Developer, events and event streams are put front and center. We’re going to look at the dimensions of event and event stream design and how to apply them to real-world problems. But dimensions and theory are nothing without best practices, so we are also going to take a look at these to help keep you clear of pitfalls and set you up for success. This course also includes hands-on exercises, during which you will work through use cases related to the different dimensions of event design and event streaming.

Click through to learn more about what’s in the course and to check it out–it is free, after all.

Comments closed

Thoughts on Current 2022

Markos Sfikas, et al, recap the Current 2022 conference:

Throughout the conference, the theme of Batch vs. Streaming was apparent. Discussions covered how they can be unified, how batch processing’s performance can / must be improved for real-time applications, and more. There was even a dedicated panel discussion with Adi Polak, Amy Chen, Eric Sammer and Tyler Akidau discussing the state of streaming adoption today, and debating if streaming will ever fully replace batch. You can view some interesting points from the panel discussion in the Twitter thread from Robin Moffatt.

Click through for the full recap.

Comments closed

Streaming Datasets in Power BI

Reza Rad needs data in real time:

Datasets in Power BI can have connection types such as Import, DirectQuery or Live Connection. However, there is also one specific type of dataset which is different. This type of dataset is called Streaming Dataset. A streaming dataset is for a real-time dashboard and comes with various setups and configurations. In this video and article, we’ll talk about this type of dataset.

Reza includes a video as well as a very helpful walkthrough.

Comments closed

Real-Time Streaming ETL with Kafka and Debezium

Dursun Koc doesn’t have time for batched ETL:

Debezium is not extracting data using SQL. It uses database log files to track the changes in the database, so it has minimum effect on the source system. For more information about Debezium, please visit their website

After the data is extracted, we need Kafka Connect to stream it into Apache Kafka in order to play with it and reshape it as we required. And we will be using ksqlDB in order to reshape the raw data in a way we are required in the target system. Let’s consider a simple ordering system database in which we have a customer table, a product table, and an orders table, as shown below.

Read on for an overview as well as a link to the GitHub repo where you can try this all out.

Comments closed

Apache Flink Updates

Danny Cranmer announces Flink 1.15.2:

The Apache Flink Community is pleased to announce the second bug fix release of the Flink 1.15 series.

This release includes 30 bug fixes, vulnerability fixes, and minor improvements for Flink 1.15. Below you will find a list of all bugfixes and improvements (excluding improvements to the build infrastructure and build stability). For a complete list of all changes see: JIRA.

We highly recommend all users upgrade to Flink 1.15.2.

In addition to that, Jingsong Lee announces Flink Table Store 0.2.0:

Flink Table Store is a data lake storage for streaming updates/deletes changelog ingestion and high-performance queries in real time.

As a new type of updatable data lake, Flink Table Store has the following features:

– Large throughput data ingestion while offering good query performance.

– High performance query with primary key filters, as fast as 100ms.

– Streaming reads are available on Lake Storage, lake storage can also be integrated with Kafka to provide second-level streaming reads.

Read on for the changes in both platforms.

Comments closed

Watermarking in Spark Structured Streaming

Max Fisher takes us through an important feature for Spark streaming:

When building real-time pipelines, one of the realities that teams have to work with is that distributed data ingestion is inherently unordered. Additionally, in the context of stateful streaming operations, teams need to be able to properly track event time progress in the stream of data they are ingesting for the proper calculation of time-window aggregations and other stateful operations. We can solve for all of this using Structured Streaming.

For example, let’s say we are a team working on building a pipeline to help our company do proactive maintenance on our mining machines that we lease to our customers. These machines always need to be running in top condition so we monitor them in real-time. We will need to perform stateful aggregations on the streaming data to understand and identify problems in the machines.

This is where we need to leverage Structured Streaming and Watermarking to produce the necessary stateful aggregations that will help inform decisions around predictive maintenance and more for these machines.

Read on to see how watermarking works in various scenarios, including when you join together streams.

Comments closed