Press "Enter" to skip to content

Category: Streaming

Visualizing Kafka Stream Lineage

David Araujo and Julia Peng show off stream lineage in Confluent Cloud:

Stream Lineage is a tool Confluent built to address the lack of data visibility in Kafka and event-driven architectures. Confluent’s Stream Lineage provides an interactive map of all your data flows that enable users to:

1. Understand what data flows are running both now or at any point in the past

2. Trace where each data flow originated from

3. Track how data is transformed along its journey

4. Observe where each data flow ends up

Read on to see how it works.

Comments closed

Automating Parallelism Decisions in Flink Batch Jobs

Lijie Wang and Zhu Zhu describe Apache Flink’s batch scheduler:

Deciding proper parallelisms of operators is not an easy work for many users. For batch jobs, a small parallelism may result in long execution time and big failover regression. While an unnecessary large parallelism may result in resource waste and more overhead cost in task deployment and network shuffling.

To decide a proper parallelism, one needs to know how much data each operator needs to process. However, It can be hard to predict data volume to be processed by a job because it can be different everyday. And it can be harder or even impossible (due to complex operators or UDFs) to predict data volume to be processed by each operator.

To solve this problem, we introduced the adaptive batch scheduler in Flink 1.15. The adaptive batch scheduler can automatically decide parallelism of an operator according to the size of its consumed datasets. 

Read on to see some of the benefits of using the adaptive batch scheduler, as well as some of the decision points it uses along the way.

Comments closed

Request-Response and CQRS in Kafka

Kai Waehner compares two message exchange patterns:

How can I do request-response communication with Apache Kafka? That’s one of the most common questions I get regularly. This blog post explores when (not) to use this message exchange pattern, the differences between synchronous and asynchronous communication, the pros and cons compared to CQRS and event sourcing, and how to implement request-response within the data streaming infrastructure.

Read on to learn more.

Comments closed

Ingesting Event Hub Telemetry Data with PySpark Streaming

Charles Chukwudozie shows how to read from Event Hubs in Databricks with Python:

Ingesting, storing, and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub.

Most documented implementations of Azure Databricks Ingestion from Azure Event Hub Data are based on Scala.

So, in this post, I outline how to use PySpark on Azure Databricks to ingest and process telemetry data from an Azure Event Hub instance configured without Event Capture.

Click through for the process.

Comments closed

A Transaction Log in Apache Flink

Roman Khachatryan and Yuan Mei deal with transaction log issues:

State backends don’t start any snapshotting work until the task receives at least one checkpoint barrier, increasing the effective checkpoint duration. This is suboptimal if the upload time is comparable to the checkpoint interval; instead, a snapshot could be uploaded continuously throughout the interval.

This work discusses the mechanism introduced in Flink 1.15 to address the above cases by continuously persisting state changes on non-volatile storage while performing materialization in the background. The basic idea is described in the following section, and then important implementation details are highlighted. Subsequent sections discuss benchmarking results, limitations, and future work.

Read on to see what they did.

Comments closed

Low-Latency Flink

Jun Qin and Nico Kruber have started a series on low-latency streaming in Apache Flink. The first two posts of the series are up, starting with the overview:

Latency can refer to different things. LatencyMarkers in Flink measure the time it takes for the markers to travel from each source operator to each downstream operator. As LatencyMarkers bypass user functions in operators, the measured latencies do not reflect the entire end-to-end latency but only a part of it. Flink also supports tracking the state access latency, which measures the response latency when state is read/written. One can also manually measure the time taken by some operators, or get this data with profilers. However, what users usually care about is the end-to-end latency, including the time spent in user-defined functions, in the stream processing framework, and when state is accessed. End-to-end latency is what we will focus on.

Part 2 discusses direct latency optimization techniques:

When interacting with external systems (e.g., RDBMS, object stores, web services) in a Flink job for data enrichment, the latency in getting responses from external systems often dominates the overall latency of the job. With Flink’s Async I/O API (e.g., AsyncDataStream.unorderedWait() or AsyncDataStream.orderedWait()), a single parallel function instance can handle many requests concurrently and receive responses asynchronously. This reduces latencies because the waiting time for responses is amortized over multiple requests.

Stay tuned for more posts in the series.

Comments closed

Apache Flink Table Store

Jingsong Lee and Jiangjie Qin have an announcement:

As of now it is quite common that people deploy a few storage systems to work with Flink for different purposes. A typical setup is a message queue for stream processing, a scannable file system / object store for batch processing and ad-hoc queries, and a K-V store for lookups. Such an architecture posts challenge in data quality and system maintenance, due to its complexity and heterogeneity. This is becoming a major issue that hurts the end-to-end user experience of streaming and batch unification brought by Apache Flink.

The goal of Flink table store is to address the above issues. This is an important step of the project. It extends Flink’s capability from computing to the storage domain. So we can provide a better end-to-end experience to the users.

Click through to see how table storage works.

Comments closed

Flink 1.15 Released

Joe Moser and Yun Gao announce Apache Flink 1.15:

Thanks to our well-organized and open community, Apache Flink continues to grow as a technology and remain one of the most active projects in the Apache community. With the release of Flink 1.15, we are proud to announce a number of exciting changes.

One of the main concepts that makes Apache Flink stand out is the unification of batch (aka bounded) and stream (aka unbounded) data processing, which helps reduce the complexity of development. A lot of effort went into this unification in the previous releases, and you can expect more efforts in this direction.

Apache Flink is not only growing when it comes to contributions and users, but also out of the original use cases. We are seeing a trend towards more business/analytics use cases implemented in low-/no-code. Flink SQL is the feature in the Flink ecosystem that enables such uses cases and this is why its popularity continues to grow.

Flink SQL is Feasel’s Law in action.

Comments closed

Streaming Data into Synapse Dedicated SQL Pool

Lionel Penuchot loads some data:

This article reviews a common pattern of streaming data (i.e. real-time message ingestion) in Synapse dedicated pool. It opens a discussion on the simple standard way to implement this, as well as the challenges and drawbacks. It then presents an alternate solution which enables optimal performance and greatly reduces maintenance tasks when using clustered column store indexes. This is aimed at developers, DBAs, architects, and anyone who works with streams of data that are captured in real-time.

I’d probably avoid the MERGE statement in there because of how many problems there are with it. That said, this is a useful pattern for trickle-loading columnstore tables.

Comments closed

From Confluent Cloud into Azure Synapse Analytics

Jacob Bogie and Dustin Vannoy show how to integrate Kafka in Confluent Cloud with pools in Azure Synapse Analytics:

Just released this fall, is the fully managed Synapse Connector. Azure Synapse Analytics provides a platform for data analysts and data scientists to analyze and combine data from multiple sources. Within Confluent Cloud, data can be synched to dedicated SQL pools via the fully managed Synapse sink connector and attached to Synapse Analytics workspace. Once added to the Synapse Analytics workspace, analysts have the ability to perform advanced analytics and reporting on data in the Confluent pipeline. The ability to access event-level data enables event-level analytics and data exploration.

Click through for two examples, one of loading data into a dedicated SQL pool and one of streaming data into Spark Streaming running on (naturally) a Spark pool.

Comments closed