Press "Enter" to skip to content

Category: Streaming

The Session Window in Flink

Kundan Kumarr continues a series on windows in Apache Flink:

In the real world, all the work that we do online- Visiting a website, Clicking around the website, do online transactions, and so on are in sessions. We might just go to an e-commerce website like amazon, looking for products, clicking around for a bit, and then stop. All is done within a session. There is a use case where these websites may want to track pages that we visited in a single session. For that, it needs to group all clicks together which are streaming in, based on a session. These streaming use cases can be implemented easily by Flink Session window.

The Session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time. The number of entities within a session window is not fixed. Because it is a user who defines typically how long the session would be. A session window closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred. For example, once we have been idle on the amazon website let say for 1 minute that is the end of the previous session and if go back to the site after 1 sec it will start a new session. The way it would determine the session is the pause between one click and another click.

Click through for a depiction and an example.

Leave a Comment

The Count Window in Flink

Kundan Kumarr takes us through an example of the count window type in Apache Flink:

In the blog, we learned about Tumbling and Sliding windows which is based on time. In this blog, we are going to learn to define Flink’s windows on other properties i.e Count window. As the name suggests, count window is evaluated when the number of records received, hits the threshold.

Count window set the window size based on how many entities exist within that window. For example, if we fixed the count as 4, every window will have exactly 4 entities. It doesn’t matter whats the size of the window in terms of time. Window size will be different but the number of entities in that window will always be the same. Count windows can have overlapping windows or non-overlapping, both are possible. The count window in Flink is applied to keyed streams means there is already a logical grouping of the stream based on all values associated with a certain key. So the entity count will apply on a per-key basis.

I’m curious if there’s a combination of count + time, triggering when you hit X elements or Y seconds, whichever comes first.

Leave a Comment

Tumbling and Sliding Windows in Flink

Kundan Kumarr takes us through two different window types in Apache Flink:

In the previous blog, we talked about Flink’s windows operator, a heart of processing infinite streams. Generally in Flink, after specifying that the stream is keyed or non keyed, the next step is to define a window assigner. The window assigner defines how elements are assigned to windows. Flink provides some useful predefined window assigners like Tumbling windowsSliding windowsSession windows, Count windows, and Global windows. We can use any of them as per our use case or even we can create custom window assigners in Flink.

In this blog, we will learn about the first two window assigners i.e., Tumbling and sliding windows. These two window assigners, assign elements to windows based on time, which can either be processing time or event time.

Click through for a description of each.

Leave a Comment

Apache Flink 1.10.2

Zhu Zhu announces Apache Flink 1.10.2:

The Apache Flink community released the second bugfix version of the Apache Flink 1.10 series.

This release includes 73 fixes and minor improvements for Flink 1.10.1. The list below includes a detailed list of all fixes and improvements.

We highly recommend all users to upgrade to Flink 1.10.2.

There are a lot of bugfixes in this release.

Comments closed

A Quick Demo: Kafka to Spark Streaming to Cassandra

Kundan Kumarr walks us through a simple data pipeline:

Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.
Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system.
Apache Cassandra is a distributed and wide-column NoSQL data store.

As I’m reading through this, I enjoyed just how straightforward the whole process was.

Comments closed

Using Flink Stateful Functions

Igal Shilman gives us an example of where Apache Flink’s Stateful Functions come in handy:

In this blog post, we’ll take a look at a class of use cases that is a natural fit for Flink Stateful Functions: monitoring and controlling networks of connected devices (often called the “Internet of Things” (IoT)).

IoT networks are composed of many individual, but interconnected components, which makes getting some kind of high-level insight into the status, problems, or optimization opportunities in these networks not trivial. Each individual device “sees” only its own state, which means that the status of groups of devices, or even the network as a whole, is often a complex aggregation of the individual devices’ state. Diagnosing, controlling, or optimizing these groups of devices thus requires distributed logic that analyzes the “bigger picture” and then acts upon it.

A powerful approach to implement this is using digital twins: each device has a corresponding virtual entity (i.e. the digital twin), which also captures their relationships and interactions. The digital twins track the status of their corresponding devices and send updates to other twins, representing groups (such as geographical regions) of devices. Those, in turn, handle the logic to obtain the network’s aggregated view, or this “bigger picture” we mentioned before.

Read on to see where stateful functions come into play.

Comments closed

Late-Arriving Data with Spark Streaming

Sarfaraz Hussain continues a series on Spark streaming:

The size of the State (discussed in the previous blog) will continue to increase indefinitely as we really don’t know when a bucket can be closed.

But practically a query is not going to receive data 1 week late or in that matter such late-arriving data is of no use to us.

So, to specify the information when to stop considering older buckets for the streaming query we use Watermark.

Read on to see how you can design and implement a watermark.

Comments closed

Stateful Stremaing with Spark

Sarfaraz Hussain continues a series on Spark Streaming:

Structured Streaming does processing under the hood as micro-batches (default nature).

state is versioned between micro-batches while the streaming query runs. So as the series of incremental execution plans are generated (discussed in Part 2), each execution plan knows what version of the state it needs to read from.

Each micro-batch reads the previous version of the state data i.e. the previous running count then updates it and creates a new version. Each of these versions gets check-pointed into the same check-point location that we have provided in the query.

Read on to understand the implications of this and what it allows you to do.

Comments closed

An Introduction to Spark Streaming

Sarfaraz Hussain has started a series on Spark Streaming. The first post gives an introduction to the topic:

The philosophy behind the development of Structured Streaming is that,

We as end user should not have to reason about streaming”.

What that means is that we as end-user should only write batch like queries and its Spark’s job to figure out how to run it on a continuous stream of data and continuously update the result as new data flows in.

Sarfaraz then follows this up with a bit on the structure of a streaming query:

So, as new data comes in Spark breaks it into micro batches (based on the Processing Trigger) and processes it and writes it out to the Parquet file.

It is Spark’s job to figure out, whether the query we have written is executed on batch data or streaming data. Since, in this case, we are reading data from a Kafka topic, so Spark will automatically figure out how to run the query incrementally on the streaming data.

Check them both out.

Comments closed