Press "Enter" to skip to content

Category: Streaming

Using Data Contracts in Confluent Schema Registry

Robert Yokota shows us how to generate data contracts for streaming solutions:

A data contract is a formal agreement between an upstream component and a downstream component on the structure and semantics of data that is in motion. The upstream component enforces the data contract, while the downstream component can assume that the data it receives conforms to the data contract. Data contracts are important because they provide transparency over dependencies and data usage in a streaming architecture. They help to ensure the consistency, reliability, and quality of the data in event streams, and they provide a single source of truth for understanding the data in motion.

Click through for a sample application that uses data contracts.

Comments closed

Running Apache Flink Jobs from HDInsight

Sairam Yeturi builds a streaming job:

Could you already complete creating your first Apache Flink® cluster and submit your streaming job on it with HDInsight on AKS?

Well, if you are yet to do that – Let me help you get started.

Click through for a step-by-step walkthrough on how to create a Flink-centric HDInsight cluster on Azure Kubernetes Service and how to create a new job, assuming you have the Jarfile for that job already.

Comments closed

Data Activator in Microsoft Fabric

Toby Smith looks at the current state of Data Activator in Microsoft Fabric:

Fabric is the newest all-in-one analytics solution from Microsoft. It combines multiple components (some existing, some new) into a single integrated environment. One of these new components is Data Activator. As Data Activator is still in development, there is still more functionality to be added. This blog shares some of the current abilities and uses for Data Activator, along with ideas for how you can use it in your own business situations.

One of the biggest challenges with big data is understanding it. With tools like Power BI, we are now able to understand and analyse data better than ever before. But when do we act on it? Do we have to manually look at these reports daily just to check everything is going ok? This is where Data Activator comes in. Data activator is a no-code tool that automatically takes actions when certain conditions are met in the data. These actions can vary from alerts in Microsoft Teams, calling stored procedures, triggering other fabric items like a pipeline, or even retraining AI models.

This is a feature which has enormous potential for near-real-time alerting and automating workflows. But do read on to learn about some of the limitations currently in the product.

Comments closed

Building a Flink Application in Java

Wade Waldron talks about a (free) new course:

Recently, I got my hands dirty working with Apache Flink®. The experience was a little overwhelming. I have spent years working with streaming technologies but Flink was new to me and the resources online were rarely what I needed. Thankfully, I had access to some of the best Flink experts in the business to provide me with first-class advice, but not everyone has access to an expert when they need one. 

To share what I learned, I created the Building Flink Applications in Java course on Confluent Developer. It provides you with hands-on experience in building a Flink application from the ground up. I also wrote this blog post to walk through an example of how to do dataflow programming with Flink. I hope these two resources will make the experience less overwhelming for others.

Click through for the blog post and check out the full course if you’re so inclined.

Comments closed

An Overview of Flink SQL

Martijn Visser continues a series on Kafka and Flink:

In the first two parts of our Inside Flink blog series, we explored the benefits of stream processing with Flink and common Flink use cases for which teams are choosing to leverage the popular framework to unlock the full potential of streaming. Specifically, we broke down the key reasons why developers are choosing Apache Flink® as their stream processing framework, as well as the ways in which they are putting it into practice. These range from streaming data pipelines to train ML models, to real-time inventory management in retail and predictive maintenance in manufacturing.

Next, we’ll dive into Flink SQL, which is a powerful data processing engine that allows developers to process and analyze large volumes of data in real time. We’ll cover how Flink SQL relates to the other Flink APIs and showcase some of its built-in functions and operations with syntax examples.

I’m naturally predisposed to blog posts which validate Feasel’s Law, so of course I was going to pick this one to recommend.

Comments closed

Tuning Kafka Connect Source Connectors

Catalin Pop makes things faster:

Kafka Connect is an open source data integration tool that simplifies the process of streaming data between Apache Kafka® and other systems. Kafka Connect has two types of connectors: source connectors and sink connectors. Source connectors allow you to read data from various sources and write it to Kafka topics. Sink connectors send data from the topics to another endpoint. This blog post discusses how to tune your source connectors to help you get the best throughput out of your compute resources. 

This includes which elements are tunable, metrics you’ll want to pay attention to along the way, and a detailed example.

Comments closed

Flink Streaming Use Cases for Kafka Users

Jean-Sebastien Brunner gives us some use cases:

In Part One of our “Inside Flink” blog series, we explored the critical role of stream processing and why developers are increasingly choosing Apache Flink® over other frameworks. 

In this second installment, we’ll showcase how innovative teams across every industry and size are putting stream processing into practice – from streaming data pipelines to train ML models or more timely analytics to fraud detection in finance and real-time inventory management in retail. We’ll also discuss how Flink is uniquely suited to support a wide spectrum of use cases and helps teams uncover immediate insights in their data streams and react to events in real time.

This article stays more at the “art of the possible” level rather than drilling into how we can do it.

Comments closed

Versioned State Store in Kafka Streams

Victoria Xia announces new functionality in Apache Kafka 3.5:

Since the introduction of stream processing, there have been three certainties in life: death, taxes, and out-of-order data. As a stream processing library built for Apache Kafka, Kafka Streams processes data in offset order. When out-of-order data is present, offset order differs from timestamp order and care must be taken to ensure that processing results respect timestamp order where appropriate. The introduction of versioned state stores to Kafka Streams in the Apache Kafka 3.5 release is a huge milestone in this direction.

In this blog post, I’ll address the what, why, and how of versioned stores in Kafka Streams, including what they are, why you might like to use them, how to get started, and a couple of things to watch out for when upgrading.

Read on to see what this entails and how you can try it out yourself.

Comments closed

Stream Processing with Flink and Kafka

Konstantin Knauf starts a new series:

There was a huge amount of buzz about Apache Flink® at this year’s Kafka Summit London. From an action-packed keynote to standing-room only breakout sessions, it’s clear that the Apache Kafka® community is hungry to learn more about Flink and how the stream processing framework fits into the modern data streaming stack.

That’s why we’re excited to introduce our new “Inside Flink” blog series that takes a deeper look at why developers and organizations everywhere are shifting their stream processing technologies to Flink. Our first blog post explains what Flink is and how it can enhance your streaming use cases running on Kafka. Future topics will include common Flink use cases, an inside look at Flink SQL, and much more.

Click through for the first post in the series, which covers what Flink is and how the two products can interoperate.

Comments closed

Contrasting Spark and Flink for Streaming Use Cases

Deepthi Mohan and Karthi Thyagarajan contrast two products:

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink shines in its ability to handle processing of data streams in real-time and low-latency stateful computations. Both support a variety of programming languages, scalable solutions for handling large amounts of data, and a wide range of connectors. Historically, Spark started out as a batch-first framework and Flink began as a streaming-first framework.

In this post, we share a comparative study of streaming patterns that are commonly used to build stream processing applications, how they can be solved using Spark (primarily Spark Structured Streaming) and Flink, and the minor variations in their approach. Examples cover code snippets in Python and SQL for both frameworks across three major themes: data preparation, data processing, and data enrichment. If you are a Spark user looking to solve your stream processing use cases using Flink, this post is for you. We do not intend to cover the choice of technology between Spark and Flink because it’s important to evaluate both frameworks for your specific workload and how the choice fits in your architecture; rather, this post highlights key differences for use cases that both these technologies are commonly considered for.

Read on for an analysis of the two products.

Comments closed