Press "Enter" to skip to content

Category: Streaming

Custom Windows in Apache Flink

Alexander Fedulov walks us through window options with Apache Flink:

In the previous articles of the series, we described how you can achieve flexible stream partitioning based on dynamically-updated configurations (a set of fraud-detection rules) and how you can utilize Flink’s Broadcast mechanism to distribute processing configuration at runtime among the relevant operators. 

Following up directly where we left the discussion of the end-to-end solution last time, in this article we will describe how you can use the “Swiss knife” of Flink – the Process Function to create an implementation that is tailor-made to match your streaming business logic requirements. Our discussion will continue in the context of the Fraud Detection engine. We will also demonstrate how you can implement your own custom replacement for time windows for cases where the out-of-the-box windowing available from the DataStream API does not satisfy your requirements. In particular, we will look at the trade-offs that you can make when designing a solution which requires low-latency reactions to individual events.

This article will describe some high-level concepts that can be applied independently, but it is recommended that you review the material in part one and part two of the series as well as checkout the code base in order to make it easier to follow along.

It’s worth giving this a careful read.

Comments closed

Building an End-to-End Streaming App with Flink SQL

Jark Wu lays down the guantlet:

Apache Flink 1.11 has released many exciting new features, including many developments in Flink SQL which is evolving at a fast pace. This article takes a closer look at how to quickly build streaming applications with Flink SQL from a practical point of view.

In the following sections, we describe how to integrate Kafka, MySQL, Elasticsearch, and Kibana with Flink SQL to analyze e-commerce user behavior in real-time. All exercises in this blogpost are performed in the Flink SQL CLI, and the entire process uses standard SQL syntax, without a single line of Java/Scala code or IDE installation.

Read on for a demo using only bash and Flink SQL.

Comments closed

Exactly-Once Semantics in Kafka

Boyang Chen and Bob Barrett look at some changes to exactly-once semantics in Apache Kafka:

When using EOS, the producer and broker both have logic to determine whether it is safe for a producer to continue to send data without violating the exactly-once guarantees. Prior to Kafka 2.5, if either the producer or broker was ever not able to make this determination, the producer would enter a fatal error state. The only way to continue processing was to close the producer and create a new one. This process is generally very disruptive to client applications. For example, if a producer fails in Kafka Streams, then the associated task needs to be migrated, which causes a rebalance of the full workload. This results in throughput drop until the rebalance is complete.

To address this issue, KIP-360 added a mechanism for producers to automatically recover when they encounter these cases and continue processing. To better understand how it works, the following describes some of the situations that can cause fatal errors.

There have been several improvements to the process. Though to be honest, when I hear someone mention exactly-once in a distributed system, it sets off my spidey senses.

Comments closed

Creating a Power BI Streaming Dataset

Rob Farley takes us through the process of creating and using a Power BI streaming dataset:

Real-time Power BI sets are a really useful feature, and there’s a good description of them at https://docs.microsoft.com/en-us/power-bi/connect-data/service-real-time-streaming. I thought I’d do a quick walkthrough specifically around the Push side, and show you – including the odd gotcha that you might not have noticed.

To create a dataset that you want to push data into, you need to go to the Power BI service, go to your Workspace, and create a Streaming dataset. Even if you’re not wanting to use it with a streaming service, this is the one you need.

Rob has plenty of animated GIFs to walk you through the process, as well as a couple of caveats if you want to play along at home.

Comments closed

Apache Flink 1.1.0 Released

Marta Paes announces Apache Flink version 1.11:

Change Data Capture (CDC) has become a popular pattern to capture committed changes from a database and propagate those changes to downstream consumers, for example to keep multiple datastores in sync and avoid common pitfalls such as dual writes. Being able to easily ingest and interpret these changelogs into the Table API/SQL has been a highly demanded feature in the Flink community — and it’s now possible with Flink 1.11.

Click through for the full list of updates.

Comments closed

FlinkSQL in Cloudera Streaming Analytics

Marton Balassi announces support for FlinkSQL in Cloudera Streaming Analytics:

Our 1.2.0.0 release of Cloudera Streaming Analytics Powered by Apache Flink brings a wide range of new functionality, including support for lineage and metadata tracking via Apache Atlas, support for connecting to Apache Kudu and the first iteration of the much-awaited FlinkSQL API.

Flink’s SQL interface democratizes stream processing, as it caters to a much larger community than the currently widely used Java and Scala APIs focusing on the Data Engineering crowd. Generalizing SQL to stream processing and streaming analytics use cases poses a set of challenges: we have to tackle expressing infinite streams and timeliness of records. 

All is happening as Feasel’s Law foretold.

Comments closed

The Basics of Spark Streaming

Muskan Gupta gives us an introduction to Spark Streaming:

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It was added to Apache Spark in 2013. We can get data from many sources such as Kafka, Flume etc. and process it using functions such as map, reduce etc. After processing we can push data to filesystem, databases and even to live dashboards.

In Spark Streaming we work on near real time data. It divides the received input stream into batches. The Spark Engine processes the batches and generate final output in batches.

Read on to understand the key mechanisms behind Spark Streaming.

Comments closed

Using Apache Flink in Zeppelin Notebooks

Jeff Zhang walks us through reviewing data streamed through Apache Flink in an Apache Zeppelin notebook:

In this post, we explained how the redesigned Flink interpreter works in Zeppelin 0.9.0 and provided some examples for performing streaming ETL jobs with Flink and Zeppelin. In the next post, I will talk about how to do streaming data visualization via Flink on Zeppelin. Besides that, you can find an additional tutorial for batch processing with Flink on Zeppelin as well as using Flink on Zeppelin for more advance operations like resource isolation, job concurrency & parallelism, multiple Hadoop & Hive environments and more on our series of posts on Medium. And here’s a list of Flink on Zeppelin tutorial videos for your reference.

Click through for the demo, and stay tuned for part 2.

Comments closed

Smoothing Out Write Behavior in Apache Flink

Dmitry Tolpeko solves an interesting problem:

It would be nice to smooth S3 write operations between two checkpoints. How to do that?

You may have already noticed there are 3 single PUT operations above made at 37:02, 37:06 and 37:09 before the checkpoint. The write size can give you a clue, it is a single part of multi-part upload to S3.

So some data sets were quite large so their data spilled before the checkpoint. Note that this is the internal spill in S3, data will not be visible until committed upon the successful Flink checkpoint.

So how can we force more writes to happen before the checkpoint so we can smooth IOPS and probably reduce the overall checkpoint latency? 

Read on for the answer.

Comments closed