Category: Streaming

Optimizing Kafka Streams Apps

Published 2019-05-06 by Kevin Feasel

Bill Bejeck and Guozhang Wang give us an idea of some Kafka Streams internals:

At a high level, when you use the Streams DSL, it auto-creates the processor nodes as well as state stores if needed, and connects them to construct the processor topology. To dig a little deeper, let’s take an example and focus on stateful operators in this section.
An important observation regarding the Streams DSL is that most stateful operations are keyed operations (e.g., joins are based on record keys, and aggregations are based on grouped-by keys), and the computation for each key is independent of all the other keys. These computational patterns fall under the term data parallelism in the distributed computing world. The straightforward way to execute data parallelism at scale is to just partition the incoming data streams by key, and work on each partition independently and in parallel. Kafka Streams leans heavily on this technique in order to achieve scalability in a distributed computing environment.

They then use that info to show you how you can make your Streams apps faster.

Comments closed

Flink: Batch as a Special Case of Streaming

Published 2019-04-23 by Kevin Feasel

Fabian Hueske and Aljoscha Krettek describe streaming versus batch processing in Apache Flink:

The Apache Flink project has followed the philosophy of taking a unified approach to batch and stream data processing, building on the core paradigm of “continuous processing of unbounded data streams” for a long time. If you think about it, carrying out offline processing of bounded data sets naturally fits the paradigm: these are just streams of recorded data that happen to end at some point in time.
Flink is not alone in this: there are other projects in the open source community that embrace “streaming first, with batch as a special case of streaming,” such as Apache Beam; and this philosophy has often been cited as a powerful way to greatly reduce the complexity of data infrastructures by building data applications that generalize across real-time and offline processing.

Check it out. At the end, the authors also describe Blink, a fork of Flink being (slowly) merged back in and which supports this paradigm.

Comments closed

Apache Flink 1.8.0 Released

Published 2019-04-11 by Kevin Feasel

Aljoscha Krettek announces the general availablity of Apache Flink version 1.8.0:

SQL pattern detection with user-defined functions and aggregations: The support of the MATCH_RECOGNIZE clause has been extended by multiple features. The addition of user-defined functions allows for custom logic during pattern detection (FLINK-10597), while adding aggregations allows for more complex CEP definitions, such as the following (FLINK-7599).

There are several really nice changes. I pointed out this one to get people to vote up Itzik Ben-Gan’s feedback item to get row pattern recognition in SQL Server.

Comments closed

Dynamic Routing with Kafka Streams

Published 2019-04-09 by Kevin Feasel

Yeva Byzek explains how you can use Kafka Streams to perform dynamic routing of messages:

A cleaner way is to provide the service with a separate stream that contains only the relevant subset of events that the microservice cares about. To achieve this, a streaming application can branch the original event stream into different substreams using the method KStream#branch(). This results in new Kafka topics, so then the microservice can subscribe to one of the branched streams directly.
For example, in the finance domain, consider a fraud remediation microservice that should process only the subset of events suspected of being fraudulent. As shown below, the original stream of events is branched into two new streams: one for suspicious events and one for validated events. This enables the fraud remediation microservice to process just the stream of suspicious events, without ever seeing the validated events.

Read on to learn more.

Comments closed

Confluent Platform 5.2 Released

Published 2019-04-04 by Kevin Feasel

Mau Barra announces Confluent Platform 5.2:

Confluent Platform 5.2 represents a significant milestone in our efforts across three key dimensions:
1. It allows you to use the entire Confluent Platform free forever in single-broker Kafka clusters, so you are freer than ever to start building new event streaming applications right away. We are also bringing librdkafka 1.0 in order to bring our C/C++, Python, Go and .NET clients closer to parity with the Java client.
2. It adds critical enhancements to Confluent Control Center that will help you meet your event streaming SLAs in distributed Apache Kafka environments at greater scale.
3. With our latest version of Confluent Replicator, you can now seamlessly stream events across on-prem and public cloud deployments.

The top item is quite interesting: a free developer license and not just a 30-day trial.

Comments closed

Kafka Streams: Streams and Tables

Published 2019-04-02 by Kevin Feasel

Neha Bhardwaj explains a couple of the key abstractions in Kafka Streams:

In this blog, we’ll move one step forward to get an understanding of the Dual streaming model to see what abstractions does KSQL use to process the data.
All the data that we are working on with KSQL is produced to Kafka topics by some client. This client can be any Application, Kafka connectors etc., which produces continuous never-ending data to the topics.
KSQL does not directly interact with these topics, it rather introduces a couple of abstractions in between to process the data, which are known as Streams and Tables.

Read on to learn what these are and why it’s useful to think in these terms.

Comments closed

Request: Add Support for Row Pattern Recognition

Published 2019-04-02 by Kevin Feasel

Itzik Ben-Gan would like to see Row Pattern Recognition make it into T-SQL:

The ISO/IEC 9075:2016 standard (aka SQL:2016) introduces support for Row Pattern Recognition (RPR) in SQL. Similar to using regular expressions to identify patterns in a string, RPR allows you to use regular expressions to identify patterns in a sequence of rows.
To me, it’s the next step in the evolution of window functions. If you think that window functions are profound and useful, RPR is really going to bake your noodle.
RPR has limitless practical applications, including identifying patterns in stock market activity, handling time series, fraud detection, material handling, shipping applications, DNA sequencing, gaps and islands, top N per group, and many others.

I’ve voted it up and recommend you do so too. This is a great way to think of streams of data sitting in a database. If you have business use cases where this could help, adding those as comments would be great too.

Comments closed

Flink and Stateful Streaming

Published 2019-03-25 by Kevin Feasel

Himanshu Gupta explains some of the benefits Apache Flink offers for stateful streaming applicatons:

The 2 main types of stream processing done are:
1. Stateless: Where every event is handled completely independent from the preceding events.
2. Stateful: Where a “state” is shared between events and therefore past events can influence the way current events are processed.
Stateless stream processing is easy to scale up because events are processed independently. But Stateful stream processing is difficult to scale up because the “state” needs to be shared across the events.

Himanshu does point out alternatives, but this isn’t a comparison exercise.

Comments closed

Building a Power BI Dashboard on Streaming Data

Published 2019-03-25 by Kevin Feasel

Annie Xu shows us how to build a Power BI dashboard on a streaming data source in Azure:

This post is about something new I have tried last week. The goal was to create simulated streaming data source, feed it into Power BI as a streaming dataset, create a report out of the streaming dataset, and then embed it to an web application. With proper directions provided by my teammates, I finished the implementation from end to end within 1.5 hours. I was super impressed by how awesome it is and how easy it is to implement so that I want to share those directions to you.

The source data is simulated but the process is the same with real data sets.

Comments closed

Using the StreamSets Snowflake Destination

Published 2019-03-19 by Kevin Feasel

Dash Desai shows how you can use StreamSets to write data into SnowflakeDB:

In particular, we’ll look at an example scenario that addresses Data Drift – where new information is added mid-stream and when that occurs the new table structure and new column values are created in Snowflake automatically.
To illustrate, let’s take HTTP web server logs generated by Apache web server (for example) as our main source of data. Here’s what a typical log line looks like:
150.47.54.136 - - [14/Jun/2014:10:30:19 -0400] "GET /department/outdoors/category/kids'%20golf%20clubs/product/Polar%20Loop%20Activity%20Tracker HTTP/1.1" 200 1026 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

Click through for the demonstration.

Comments closed