Category: Streaming

Watermarking in Spark Structured Streaming

Published 2022-08-26 by Kevin Feasel

Max Fisher takes us through an important feature for Spark streaming:

When building real-time pipelines, one of the realities that teams have to work with is that distributed data ingestion is inherently unordered. Additionally, in the context of stateful streaming operations, teams need to be able to properly track event time progress in the stream of data they are ingesting for the proper calculation of time-window aggregations and other stateful operations. We can solve for all of this using Structured Streaming.
For example, let’s say we are a team working on building a pipeline to help our company do proactive maintenance on our mining machines that we lease to our customers. These machines always need to be running in top condition so we monitor them in real-time. We will need to perform stateful aggregations on the streaming data to understand and identify problems in the machines.
This is where we need to leverage Structured Streaming and Watermarking to produce the necessary stateful aggregations that will help inform decisions around predictive maintenance and more for these machines.

Read on to see how watermarking works in various scenarios, including when you join together streams.

Comments closed

Securing Kafka Streams

Published 2022-08-24 by Kevin Feasel

Amani Newton gives us a primer on Apache Kafka security:

The largest companies in the world use Apache Kafka^® for their real-time streaming data pipelines and applications. Kafka is the basis for the real-time fraud text alerts from your bank and the network-connected medical devices used in your local hospital. Securing customer or patient data as it flows through the Kafka system is crucial. However, out of the box, Kafka has relatively little security enabled. This blog post previews the free Confluent Developer course that teaches the basics of securing your Apache Kafka-based system.

Click through for the overview.

Comments closed

From Kafka to Azure Data Explorer with Protobuf Data

Published 2022-08-16 by Kevin Feasel

Anshul Sharma and Ramachandran G do a bit of converting:

Kafka is increasingly become a popular choice of scalable message queueing for large data processing workloads. This makes it very popular in IoT based ecosystem where there is large ingress in data before data processing (or) data storage. Azure Data Explorer is a very powerful time series and analytics database that suits IoT scale data ingestion and data querying.
Kafka supports ingestion of data in multiple formats including JSON, Avro, Protobuf and String. ADX supports ingestion of data from Kafka into ADX in all these formats. Due to excellent schema support, extensibility to various platforms and compression, [protobuf](https://developers.google.com/protocol-buffers) is increasingly becoming a data exchange choice in IoT based systems. The ADX Kafka sink connector leverages the Kafka Connect framework and provides an adapter to ingest data from Kafka in all these formats.
The following section aims to provide configuration to support ingestion of protobuf data from Kafka to ADX.

Click through for the high-level architecture and a deeper dive into the process.

Comments closed

Visualizing Kafka Stream Lineage

Published 2022-07-07 by Kevin Feasel

David Araujo and Julia Peng show off stream lineage in Confluent Cloud:

Stream Lineage is a tool Confluent built to address the lack of data visibility in Kafka and event-driven architectures. Confluent’s Stream Lineage provides an interactive map of all your data flows that enable users to:
1. Understand what data flows are running both now or at any point in the past
2. Trace where each data flow originated from
3. Track how data is transformed along its journey
4. Observe where each data flow ends up

Read on to see how it works.

Comments closed

Automating Parallelism Decisions in Flink Batch Jobs

Published 2022-06-17 by Kevin Feasel

Lijie Wang and Zhu Zhu describe Apache Flink’s batch scheduler:

Deciding proper parallelisms of operators is not an easy work for many users. For batch jobs, a small parallelism may result in long execution time and big failover regression. While an unnecessary large parallelism may result in resource waste and more overhead cost in task deployment and network shuffling.
To decide a proper parallelism, one needs to know how much data each operator needs to process. However, It can be hard to predict data volume to be processed by a job because it can be different everyday. And it can be harder or even impossible (due to complex operators or UDFs) to predict data volume to be processed by each operator.
To solve this problem, we introduced the adaptive batch scheduler in Flink 1.15. The adaptive batch scheduler can automatically decide parallelism of an operator according to the size of its consumed datasets.

Read on to see some of the benefits of using the adaptive batch scheduler, as well as some of the decision points it uses along the way.

Comments closed

Request-Response and CQRS in Kafka

Published 2022-06-07 by Kevin Feasel

Kai Waehner compares two message exchange patterns:

How can I do request-response communication with Apache Kafka? That’s one of the most common questions I get regularly. This blog post explores when (not) to use this message exchange pattern, the differences between synchronous and asynchronous communication, the pros and cons compared to CQRS and event sourcing, and how to implement request-response within the data streaming infrastructure.

Read on to learn more.

Comments closed

Ingesting Event Hub Telemetry Data with PySpark Streaming

Published 2022-06-06 by Kevin Feasel

Charles Chukwudozie shows how to read from Event Hubs in Databricks with Python:

Ingesting, storing, and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub.
Most documented implementations of Azure Databricks Ingestion from Azure Event Hub Data are based on Scala.
So, in this post, I outline how to use PySpark on Azure Databricks to ingest and process telemetry data from an Azure Event Hub instance configured without Event Capture.

Click through for the process.

Comments closed

A Transaction Log in Apache Flink

Published 2022-06-03 by Kevin Feasel

Roman Khachatryan and Yuan Mei deal with transaction log issues:

State backends don’t start any snapshotting work until the task receives at least one checkpoint barrier, increasing the effective checkpoint duration. This is suboptimal if the upload time is comparable to the checkpoint interval; instead, a snapshot could be uploaded continuously throughout the interval.
This work discusses the mechanism introduced in Flink 1.15 to address the above cases by continuously persisting state changes on non-volatile storage while performing materialization in the background. The basic idea is described in the following section, and then important implementation details are highlighted. Subsequent sections discuss benchmarking results, limitations, and future work.

Read on to see what they did.

Comments closed

Low-Latency Flink

Published 2022-05-24 by Kevin Feasel

Jun Qin and Nico Kruber have started a series on low-latency streaming in Apache Flink. The first two posts of the series are up, starting with the overview:

Latency can refer to different things. LatencyMarkers in Flink measure the time it takes for the markers to travel from each source operator to each downstream operator. As LatencyMarkers bypass user functions in operators, the measured latencies do not reflect the entire end-to-end latency but only a part of it. Flink also supports tracking the state access latency, which measures the response latency when state is read/written. One can also manually measure the time taken by some operators, or get this data with profilers. However, what users usually care about is the end-to-end latency, including the time spent in user-defined functions, in the stream processing framework, and when state is accessed. End-to-end latency is what we will focus on.

Part 2 discusses direct latency optimization techniques:

When interacting with external systems (e.g., RDBMS, object stores, web services) in a Flink job for data enrichment, the latency in getting responses from external systems often dominates the overall latency of the job. With Flink’s Async I/O API (e.g., AsyncDataStream.unorderedWait() or AsyncDataStream.orderedWait()), a single parallel function instance can handle many requests concurrently and receive responses asynchronously. This reduces latencies because the waiting time for responses is amortized over multiple requests.

Stay tuned for more posts in the series.

Comments closed

Apache Flink Table Store

Published 2022-05-19 by Kevin Feasel

Jingsong Lee and Jiangjie Qin have an announcement:

As of now it is quite common that people deploy a few storage systems to work with Flink for different purposes. A typical setup is a message queue for stream processing, a scannable file system / object store for batch processing and ad-hoc queries, and a K-V store for lookups. Such an architecture posts challenge in data quality and system maintenance, due to its complexity and heterogeneity. This is becoming a major issue that hurts the end-to-end user experience of streaming and batch unification brought by Apache Flink.
The goal of Flink table store is to address the above issues. This is an important step of the project. It extends Flink’s capability from computing to the storage domain. So we can provide a better end-to-end experience to the users.

Click through to see how table storage works.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31