Press "Enter" to skip to content

Category: Streaming

Calculations in Power BI Streaming Datasets

Reza Rad has a workaround for us:

If you use a streaming dataset in Power BI, you cannot download the Power BI file, and you cannot open it using Power BI Desktop. This means that you are limited not to use calculations in a streaming dataset. However, there is a small trick which you can use and can be helpful. I will show you that in this article and video.

Click through for the article, which includes the video.

Comments closed

Answering NiFi Questions

Pierre Villard has a few answers to questions about Apache NiFi:

Over the last few weeks, I delivered four live NiFi demo sessions, showing how to use NiFi connectors and processors to connect to various systems, with 1000 attendees in different geographic regions. I want to thank you all for joining and attending these events! Interactive demo sessions and live Q&A are what we all need these days when working remotely from home is now a norm.  If you have not seen my live demo session, you can catch up by watching it here

I received hundreds of questions during these events, and my colleagues and I tried to answer as many as we could. As promised, here are my answers to some of the most frequently asked questions. 

Click through for the questions and answers.

Comments closed

Spark Streaming in a Databricks Notebook

Tomaz Kastrun shows off Spark Streaming in a Databricks notebook:

Spark Streaming is the process that can analyse not only batches of data but also streams of data in near real-time. It gives the powerful interactive and analytical applications across both hot and cold data (streaming data and historical data). Spark Streaming is a fault tolerance system, meaning due to lineage of operations, Spark will always remember where you stopped and in case of a worker error, another worker can always recreate all the data transformation from partitioned RDD (assuming that all the RDD transformations are deterministic).

Click through for the demo.

Comments closed

Moving Away from the Lambda Architecture

Xiang Zhang and Jingyu Zhu talk about migrating a project away from the Lambda architecture:

The Lambda architecture has become a popular architectural style that promises both speed and accuracy in data processing by using a hybrid approach of both batch processing and stream processing methods. But it also has some drawbacks, such as complexity and additional development/operational overheads. One of our features for Premium members on LinkedIn, Who Viewed Your Profile (WVYP), relied on a Lambda architecture for some time. The backend system supporting this feature had gone through a few architectural iterations in the past years: it started as a Kafka client processing a single Kafka topic, and eventually evolved to a Lambda architecture with more complicated processing logic. However, in an effort to pursue faster product iteration and lower operational overheads, we recently underwent a transition to make it Lambda-less. In this blog post, we’ll share some of the lessons learned in operating this system in the Lambda architecture, the decisions made in transitioning to Lambda-less, and the shifts necessary to undergo this transition.

When Lambda was first proposed back in 2015, it was intended as a compromise architecture trying to solve several important problems with the tools available in 2015 (well, 2013 and 2014—it was in a book, after all). I could definitely see the architecture fall into disuse within the next decade, not because it was at all bad, but because the world around it changed to the point that there is a better compromise available.

Comments closed

Apache Flink 1.12.0 Released

Marta Paes and Aljoscha Krettek announce a new release of Apache Flink:

– The community has added support for efficient batch execution in the DataStream API. This is the next major milestone towards achieving a truly unified runtime for both batch and stream processing.

Kubernetes-based High Availability (HA) was implemented as an alternative to ZooKeeper for highly available production setups.

– The Kafka SQL connector has been extended to work in upsert mode, supported by the ability to handle connector metadata in SQL DDL. Temporal table joins can now also be fully expressed in SQL, no longer depending on the Table API.

– Support for the DataStream API in PyFlink expands its usage to more complex scenarios that require fine-grained control over state and time, and it’s now possible to deploy PyFlink jobs natively on Kubernetes.

Read on for more details on these as well as other changes.

Comments closed

Joining Data Streams in Flink

Kundan Kumarr crosses the streams:

Apache Flink offers rich sources of API and operators which makes Flink application developers productive in terms of dealing with the multiple data streams. Flink provides many multi streams operations like UnionJoin, and so on. In this blog, we will explore the Window Join operator in Flink with an example. It joins two data streams on a given key and a common window.

Click through for an example of the fluent API approach. It’s not as nice as proper SQL, but it does the job.

Comments closed

Intrusion Detection using ksqldb

Maxime Ribera and Geraud Duge de Beronville take us through an interesting tutorial:

Apache Kafka® is a distributed real-time processing platform that allows for the ingestion of huge volumes of data. ksqlDB is part of the Kafka ecosystem and offers a SQL-like language to query and process large-scale, real-time data. This blog post demonstrates how to quickly process network activity for detection intrusion using both Kafka and ksqlDB.

For testing purposes (and to avoid being banned from the enterprise network), a virtualized environment through Vagrant is used.

Click through for the scenario.

Comments closed

Stream Processing with ksqldb

Michael Drogalis takes us through how stream processing works with ksqldb:

ksqlDB, the event streaming database, is becoming one of the most popular ways to work with Apache Kafka®. Every day, we answer many questions about the project, but here’s a question with an answer that we are always trying to improve: How does ksqlDB work?

The mechanics behind stream processing can be challenging to grasp. The concepts are abstract, and many of them involve motion—two things that are hard for the mind’s eye to visualize. Let’s pop open the hood of ksqlDB to explore its essential concepts, how each works, and how it all relates to Kafka.

Click through for a demo with animations.

Comments closed

The Session Window in Flink

Kundan Kumarr continues a series on windows in Apache Flink:

In the real world, all the work that we do online- Visiting a website, Clicking around the website, do online transactions, and so on are in sessions. We might just go to an e-commerce website like amazon, looking for products, clicking around for a bit, and then stop. All is done within a session. There is a use case where these websites may want to track pages that we visited in a single session. For that, it needs to group all clicks together which are streaming in, based on a session. These streaming use cases can be implemented easily by Flink Session window.

The Session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time. The number of entities within a session window is not fixed. Because it is a user who defines typically how long the session would be. A session window closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred. For example, once we have been idle on the amazon website let say for 1 minute that is the end of the previous session and if go back to the site after 1 sec it will start a new session. The way it would determine the session is the pause between one click and another click.

Click through for a depiction and an example.

Comments closed

The Count Window in Flink

Kundan Kumarr takes us through an example of the count window type in Apache Flink:

In the blog, we learned about Tumbling and Sliding windows which is based on time. In this blog, we are going to learn to define Flink’s windows on other properties i.e Count window. As the name suggests, count window is evaluated when the number of records received, hits the threshold.

Count window set the window size based on how many entities exist within that window. For example, if we fixed the count as 4, every window will have exactly 4 entities. It doesn’t matter whats the size of the window in terms of time. Window size will be different but the number of entities in that window will always be the same. Count windows can have overlapping windows or non-overlapping, both are possible. The count window in Flink is applied to keyed streams means there is already a logical grouping of the stream based on all values associated with a certain key. So the entity count will apply on a per-key basis.

I’m curious if there’s a combination of count + time, triggering when you hit X elements or Y seconds, whichever comes first.

Comments closed