Category: Streaming

Sort-Based Blocking Shuffle in Flink

Published 2021-11-05 by Kevin Feasel

Yingjie Cao and Daisy Tsang have a multi-part series on sort-based blocking shuffles in Apache Flink. Part 1 acts as an overview:

The hash-based blocking shuffle has been supported in Flink for a long time. However, compared to the sort-based approach, it can have several weaknesses:
1. Stability: For batch jobs with high parallelism (tens of thousands of subtasks), the hash-based approach opens many files concurrently while writing or reading data, which can give high pressure to the file system (i.e. maintenance of too many file metas, exhaustion of inodes or file descriptors). We have encountered many stability issues when running large-scale batch jobs via the hash-based blocking shuffle.
2. Performance: For large-scale batch jobs, the hash-based approach can produce too many small files: for each data shuffle (or connection), the number of output files is (producer parallelism) * (consumer parallelism) and the average size of each file is (shuffle data size) / (number of files). The random IO caused by writing/reading these fragmented files can influence the shuffle performance a lot, especially on spinning disks. See the benchmark results section for more information.
By introducing the sort-based blocking shuffle implementation, fewer data files will be created and opened, and more sequential reads are done. As a result, better stability and performance can be achieved.

Part 2 provides some design considerations:

As discussed above, the hash-based blocking shuffle would produce too many small files for large-scale batch jobs. Producing fewer files can help to improve both stability and performance. The sort-merge approach has been widely adopted to solve this problem. By first writing to the in-memory buffer and then sorting and spilling the data into a file after the in-memory buffer is full, the number of output files can be reduced, which becomes (total data size) / (in-memory buffer size). Then by merging the produced files together, the number of files can be further reduced and larger data blocks can provide better sequential reads.
Flink’s sort-based blocking shuffle adopts a similar logic. A core difference is that data spilling will always append data to the same file so only one file will be spilled for each output, which means fewer files are produced.

Check it out for a behind-the-scenes look at

Comments closed

Combining Hazelcast and Kibana

Published 2021-11-03 by Kevin Feasel

Nicolas Fraenkel shows off data from Hazelcast in Kibana:

Hazelcast data pipelines work by regularly polling the source. With an HTTP endpoint, that’s straightforward, but with SSE, not so much as SSE relies on subscription. Hence, we need to implement a custom Source and design it around an internal queue to store the changes as they arrive, while polling will dequeue and send them further down the pipeline.

Read on for code and explanation.

Comments closed

A Primer on Kafka Streams

Published 2021-10-29 by Kevin Feasel

Bill Bejeck has an introduction to Kafka Streams:

Kafka Streams is an abstraction over Apache Kafka^® producers and consumers that lets you forget about low-level details and focus on processing your Kafka data. You could of course write your own code to process your data using the vanilla Kafka clients, but the Kafka Streams equivalent will have far fewer lines, because it’s declarative rather than imperative. As a library, Kafka Streams lets you create a standalone application that can be run anywhere that can connect to a Kafka broker, whether that’s a laptop or a hefty cloud server. You just need to provide it with the host and port name of a broker. Combining Kafka Streams with Confluent Cloud grants you even more processing power with very little code investment.

Click through for a description as well as a whole series of embedded videos.

Comments closed

Real-Time Change Detection via Cumulative Sums

Published 2021-10-22 by Kevin Feasel

Nithin Sankar tracks deviations with cumulative sums:

With the advent of Internet of Things (IOT) and the proliferation of connected devices, comes the challenge of monitoring parts for maintenance before they break down. A common approach revolves around getting data from connected devices and performing a statistical test to determine the likelihood of the device failing. While this common approach is robust, it typically involves a significant time investment in exploratory data analysis, feature engineering, training, and testing to build a predictive model. It, therefore, often lacks the agility required to keep up with the monitoring demands of increasingly time-sensitive initiatives.
In this context, the question becomes: how can we ensure a similar degree of rigor, but also improve the timeliness and responsiveness of being able to perform predictive maintenance?

Click through for the process, as well as an example using Azure Stream Analytics and Power BI.

Comments closed

Session Windows in Spark Structured Streaming

Published 2021-10-19 by Kevin Feasel

Jungtaek Lim, et al, announce support for session windows in Spark Structured Streaming:

Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals. An input can only be bound to a single window.
Sliding windows are similar to the tumbling windows from the point of being “fixed-sized”, but windows can overlap if the duration of the slide is smaller than the duration of the window, and in this case, an input can be bound to the multiple windows.
Session windows have a different characteristic compared to the previous two types. Session window has a dynamic size of the window length, depending on the inputs. A session window starts with an input and expands itself if the following input has been received within the gap duration. A session window closes when there’s no input received within the gap duration after receiving the latest input. This enables you to group events until there are no new events for a specified time duration (inactivity).

Click through for more details. You could implement session windows when querying existing data using a gaps and islands approach (where you increment the island count when you have a lagged difference greater than the cutoff point), but for streaming scenarios, it’s very handy to have this as a native window type.

Comments closed

Apache Flink 1.14.0 Released

Published 2021-10-04 by Kevin Feasel

Stephan Ewen and Johannes Moser have aa round-up of the latest Apache Flink updates:

The Apache Software Foundation recently released its annual report and Apache Flink once again made it on the list of the top 5 most active projects! This remarkable activity also shows in the new 1.14.0 release. Once again, more than 200 contributors worked on over 1,000 issues. We are proud of how this community is consistently moving the project forward.
This release brings many new features and improvements in areas such as the SQL API, more connector support, checkpointing, and PyFlink. A major area of changes in this release is the integrated streaming & batch experience. We believe that, in practice, unbounded stream processing goes hand-in-hand with bounded- and batch processing tasks, because many use cases require processing historic data from various sources alongside streaming data. Examples are data exploration when developing new applications, bootstrapping state for new applications, training models to be applied in a streaming application, or re-processing data after fixes/upgrades.

Read on for the list of changes.

Comments closed

Where Kafka Connect Fits

Published 2021-09-17 by Kevin Feasel

Shivani Sarthi explains the value of Kafka Connect:

Kafka connect is not just a free, open source component of Apache Kafka. But it also works as a centralised data hub for simple data integration between databases, key-value stores etc. The fundamental components include-
– Connectors
– Tasks
– Workers
– Converters
– Transforms
– Dead letter Queue
Moreover it is a framework to stream data in and out of Apache Kafka. In addition, the confluent platform comes with many built-in connectors,used for streaming data to and from different data sources.

Click through for information on each component.

Comments closed

Data Lakehouse Point-of-Sale Analytics Demo

Published 2021-09-10 by Kevin Feasel

Bryan Smith and Rob Saker share a pattern:

Disruptions in the supply chain – from reduced product supply and diminished warehouse capacity – coupled with rapidly shifting consumer expectations for seamless omnichannel experiences are driving retailers to rethink how they use data to manage their operations. Prior to the pandemic, 71% of retailers named lack of real-time visibility into inventory as a top obstacle to achieving their omnichannel goals. The pandemic only increased demand for integrated online and in-store experiences, placing even more pressure on retailers to present accurate product availability and manage order changes on the fly. Better access to real-time information is the key to meeting consumer demands in the new normal.
In this blog, we’ll address the need for real-time data in retail, and how to overcome the challenges of moving real-time streaming of point-of-sale data at scale with a data lakehouse.

It’s a cool scenario, at the least.

Comments closed

Change Data Capture with Kafka Connect and Cassandra

Published 2021-08-13 by Kevin Feasel

Paul Brebner picks up where a series left off:

We introduced the Debezium architecture and its use of Kafka Connect and explored how the Debezium Cassandra Connector (on the source side of the CDC pipeline) emits change events to Kafka for different database operations.
In the second part of this blog series, we examine how Kafka sink connectors can use the change data, discover that Debezium also propagates database schema changes (in different ways), and summarize our experiences with the Debezium Cassandra Connector used for customer deployment.

Read on for information on some of the concepts, as well as experiences working with the Debezium Cassandra connector.

Comments closed

Tips for Decreasing the Impact of Rebalancing in Kafka Streams

Published 2021-07-28 by Kevin Feasel

Vasyl Sarzhynskyi has some techniques to make rebalancing in Kafka less of a big deal:

Kafka Rebalance happens when a new consumer is either added (joined) into the consumer group or removed (left). It becomes dramatic during application service deployment rollout, as multiple instances restarted at the same time, and rebalance latency significantly increasing. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. Some business cases could tolerate rebalancing, meanwhile, others require real-time event processing and it’s painful to have delays in more than a few seconds. Here we will try to figure out how to decrease rebalance for Kafka-Streams clients (even though some tips will be useful for other Kafka consumer clients as well).

Read on for an example of the problem, as well as several practical tips for mitigating the issue.

Comments closed