Press "Enter" to skip to content

Category: Streaming

Apache Flink 1.12.0 Released

Marta Paes and Aljoscha Krettek announce a new release of Apache Flink:

– The community has added support for efficient batch execution in the DataStream API. This is the next major milestone towards achieving a truly unified runtime for both batch and stream processing.

Kubernetes-based High Availability (HA) was implemented as an alternative to ZooKeeper for highly available production setups.

– The Kafka SQL connector has been extended to work in upsert mode, supported by the ability to handle connector metadata in SQL DDL. Temporal table joins can now also be fully expressed in SQL, no longer depending on the Table API.

– Support for the DataStream API in PyFlink expands its usage to more complex scenarios that require fine-grained control over state and time, and it’s now possible to deploy PyFlink jobs natively on Kubernetes.

Read on for more details on these as well as other changes.

Comments closed

Joining Data Streams in Flink

Kundan Kumarr crosses the streams:

Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like UnionJoin, and so on. In this blog, we will explore the Window Join operator in Flink with an example. It joins two data streams on a given key and a common window.

Click through for an example of the fluent API approach. It’s not as nice as proper SQL, but it does the job.

Comments closed

Intrusion Detection using ksqldb

Maxime Ribera and Geraud Duge de Beronville take us through an interesting tutorial:

Apache Kafka® is a distributed real-time processing platform that allows for the ingestion of huge volumes of data. ksqlDB is part of the Kafka ecosystem and offers a SQL-like language to query and process large-scale, real-time data. This blog post demonstrates how to quickly process network activity for detection intrusion using both Kafka and ksqlDB.

For testing purposes (and to avoid being banned from the enterprise network), a virtualized environment through Vagrant is used.

Click through for the scenario.

Comments closed

Stream Processing with ksqldb

Michael Drogalis takes us through how stream processing works with ksqldb:

ksqlDB, the event streaming database, is becoming one of the most popular ways to work with Apache Kafka®. Every day, we answer many questions about the project, but here’s a question with an answer that we are always trying to improve: How does ksqlDB work?

The mechanics behind stream processing can be challenging to grasp. The concepts are abstract, and many of them involve motion—two things that are hard for the mind’s eye to visualize. Let’s pop open the hood of ksqlDB to explore its essential concepts, how each works, and how it all relates to Kafka.

Click through for a demo with animations.

Comments closed

The Session Window in Flink

Kundan Kumarr continues a series on windows in Apache Flink:

In the real world, all the work that we do online- Visiting a website, Clicking around the website, do online transactions, and so on are in sessions. We might just go to an e-commerce website like amazon, looking for products, clicking around for a bit, and then stop. All is done within a session. There is a use case where these websites may want to track pages that we visited in a single session. For that, it needs to group all clicks together which are streaming in, based on a session. These streaming use cases can be implemented easily by Flink Session window.

The Session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time. The number of entities within a session window is not fixed. Because it is a user who defines typically how long the session would be. A session window closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred. For example, once we have been idle on the amazon website let say for 1 minute that is the end of the previous session and if go back to the site after 1 sec it will start a new session. The way it would determine the session is the pause between one click and another click.

Click through for a depiction and an example.

Comments closed

The Count Window in Flink

Kundan Kumarr takes us through an example of the count window type in Apache Flink:

In the blog, we learned about Tumbling and Sliding windows which is based on time. In this blog, we are going to learn to define Flink’s windows on other properties i.e Count window. As the name suggests, count window is evaluated when the number of records received, hits the threshold.

Count window set the window size based on how many entities exist within that window. For example, if we fixed the count as 4, every window will have exactly 4 entities. It doesn’t matter whats the size of the window in terms of time. Window size will be different but the number of entities in that window will always be the same. Count windows can have overlapping windows or non-overlapping, both are possible. The count window in Flink is applied to keyed streams means there is already a logical grouping of the stream based on all values associated with a certain key. So the entity count will apply on a per-key basis.

I’m curious if there’s a combination of count + time, triggering when you hit X elements or Y seconds, whichever comes first.

Comments closed

Tumbling and Sliding Windows in Flink

Kundan Kumarr takes us through two different window types in Apache Flink:

In the previous blog, we talked about Flink’s windows operator, a heart of processing infinite streams. Generally in Flink, after specifying that the stream is keyed or non keyed, the next step is to define a window assigner. The window assigner defines how elements are assigned to windows. Flink provides some useful predefined window assigners like Tumbling windowsSliding windowsSession windows, Count windows, and Global windows. We can use any of them as per our use case or even we can create custom window assigners in Flink.

In this blog, we will learn about the first two window assigners i.e., Tumbling and sliding windows. These two window assigners, assign elements to windows based on time, which can either be processing time or event time.

Click through for a description of each.

Comments closed

Apache Flink 1.10.2

Zhu Zhu announces Apache Flink 1.10.2:

The Apache Flink community released the second bugfix version of the Apache Flink 1.10 series.

This release includes 73 fixes and minor improvements for Flink 1.10.1. The list below includes a detailed list of all fixes and improvements.

We highly recommend all users to upgrade to Flink 1.10.2.

There are a lot of bugfixes in this release.

Comments closed

A Quick Demo: Kafka to Spark Streaming to Cassandra

Kundan Kumarr walks us through a simple data pipeline:

Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.
Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system.
Apache Cassandra is a distributed and wide-column NoSQL data store.

As I’m reading through this, I enjoyed just how straightforward the whole process was.

Comments closed