Press "Enter" to skip to content

Category: Streaming

Streaming Datasets in Power BI

Reza Rad needs data in real time:

Datasets in Power BI can have connection types such as Import, DirectQuery or Live Connection. However, there is also one specific type of dataset which is different. This type of dataset is called Streaming Dataset. A streaming dataset is for a real-time dashboard and comes with various setups and configurations. In this video and article, we’ll talk about this type of dataset.

Reza includes a video as well as a very helpful walkthrough.

Comments closed

Real-Time Streaming ETL with Kafka and Debezium

Dursun Koc doesn’t have time for batched ETL:

Debezium is not extracting data using SQL. It uses database log files to track the changes in the database, so it has minimum effect on the source system. For more information about Debezium, please visit their website

After the data is extracted, we need Kafka Connect to stream it into Apache Kafka in order to play with it and reshape it as we required. And we will be using ksqlDB in order to reshape the raw data in a way we are required in the target system. Let’s consider a simple ordering system database in which we have a customer table, a product table, and an orders table, as shown below.

Read on for an overview as well as a link to the GitHub repo where you can try this all out.

Comments closed

Apache Flink Updates

Danny Cranmer announces Flink 1.15.2:

The Apache Flink Community is pleased to announce the second bug fix release of the Flink 1.15 series.

This release includes 30 bug fixes, vulnerability fixes, and minor improvements for Flink 1.15. Below you will find a list of all bugfixes and improvements (excluding improvements to the build infrastructure and build stability). For a complete list of all changes see: JIRA.

We highly recommend all users upgrade to Flink 1.15.2.

In addition to that, Jingsong Lee announces Flink Table Store 0.2.0:

Flink Table Store is a data lake storage for streaming updates/deletes changelog ingestion and high-performance queries in real time.

As a new type of updatable data lake, Flink Table Store has the following features:

– Large throughput data ingestion while offering good query performance.

– High performance query with primary key filters, as fast as 100ms.

– Streaming reads are available on Lake Storage, lake storage can also be integrated with Kafka to provide second-level streaming reads.

Read on for the changes in both platforms.

Comments closed

Watermarking in Spark Structured Streaming

Max Fisher takes us through an important feature for Spark streaming:

When building real-time pipelines, one of the realities that teams have to work with is that distributed data ingestion is inherently unordered. Additionally, in the context of stateful streaming operations, teams need to be able to properly track event time progress in the stream of data they are ingesting for the proper calculation of time-window aggregations and other stateful operations. We can solve for all of this using Structured Streaming.

For example, let’s say we are a team working on building a pipeline to help our company do proactive maintenance on our mining machines that we lease to our customers. These machines always need to be running in top condition so we monitor them in real-time. We will need to perform stateful aggregations on the streaming data to understand and identify problems in the machines.

This is where we need to leverage Structured Streaming and Watermarking to produce the necessary stateful aggregations that will help inform decisions around predictive maintenance and more for these machines.

Read on to see how watermarking works in various scenarios, including when you join together streams.

Comments closed

Securing Kafka Streams

Amani Newton gives us a primer on Apache Kafka security:

The largest companies in the world use Apache Kafka® for their real-time streaming data pipelines and applications. Kafka is the basis for the real-time fraud text alerts from your bank and the network-connected medical devices used in your local hospital. Securing customer or patient data as it flows through the Kafka system is crucial. However, out of the box, Kafka has relatively little security enabled. This blog post previews the free Confluent Developer course that teaches the basics of securing your Apache Kafka-based system.

Click through for the overview.

Comments closed

From Kafka to Azure Data Explorer with Protobuf Data

Anshul Sharma and Ramachandran G do a bit of converting:

Kafka is increasingly become a popular choice of scalable message queueing for large data processing workloads. This makes it very popular in IoT based ecosystem where there is large ingress in data before data processing (or) data storage. Azure Data Explorer  is a very powerful time series and analytics database that suits IoT scale data ingestion and data querying.  

Kafka supports ingestion of data in multiple formats including JSON, Avro, Protobuf and String. ADX supports ingestion of data from Kafka into ADX in all these formats. Due to excellent schema support, extensibility to various platforms and compression, [protobuf](https://developers.google.com/protocol-buffers) is increasingly becoming a data exchange choice in IoT based systems. The ADX Kafka sink connector leverages the Kafka Connect framework and provides an adapter to ingest data from Kafka in all these formats. 

The following section aims to provide configuration to support ingestion of protobuf data from Kafka to ADX. 

Click through for the high-level architecture and a deeper dive into the process.

Comments closed

Visualizing Kafka Stream Lineage

David Araujo and Julia Peng show off stream lineage in Confluent Cloud:

Stream Lineage is a tool Confluent built to address the lack of data visibility in Kafka and event-driven architectures. Confluent’s Stream Lineage provides an interactive map of all your data flows that enable users to:

1. Understand what data flows are running both now or at any point in the past

2. Trace where each data flow originated from

3. Track how data is transformed along its journey

4. Observe where each data flow ends up

Read on to see how it works.

Comments closed

Automating Parallelism Decisions in Flink Batch Jobs

Lijie Wang and Zhu Zhu describe Apache Flink’s batch scheduler:

Deciding proper parallelisms of operators is not an easy work for many users. For batch jobs, a small parallelism may result in long execution time and big failover regression. While an unnecessary large parallelism may result in resource waste and more overhead cost in task deployment and network shuffling.

To decide a proper parallelism, one needs to know how much data each operator needs to process. However, It can be hard to predict data volume to be processed by a job because it can be different everyday. And it can be harder or even impossible (due to complex operators or UDFs) to predict data volume to be processed by each operator.

To solve this problem, we introduced the adaptive batch scheduler in Flink 1.15. The adaptive batch scheduler can automatically decide parallelism of an operator according to the size of its consumed datasets. 

Read on to see some of the benefits of using the adaptive batch scheduler, as well as some of the decision points it uses along the way.

Comments closed

Request-Response and CQRS in Kafka

Kai Waehner compares two message exchange patterns:

How can I do request-response communication with Apache Kafka? That’s one of the most common questions I get regularly. This blog post explores when (not) to use this message exchange pattern, the differences between synchronous and asynchronous communication, the pros and cons compared to CQRS and event sourcing, and how to implement request-response within the data streaming infrastructure.

Read on to learn more.

Comments closed

Ingesting Event Hub Telemetry Data with PySpark Streaming

Charles Chukwudozie shows how to read from Event Hubs in Databricks with Python:

Ingesting, storing, and processing millions of telemetry data from a plethora of remote IoT devices and Sensors has become common place. One of the primary Cloud services used to process streaming telemetry events at scale is Azure Event Hub.

Most documented implementations of Azure Databricks Ingestion from Azure Event Hub Data are based on Scala.

So, in this post, I outline how to use PySpark on Azure Databricks to ingest and process telemetry data from an Azure Event Hub instance configured without Event Capture.

Click through for the process.

Comments closed