Press "Enter" to skip to content

Category: Streaming

Processing GitHub Data with Kafka Streams

Lucia Cerchie hits the GItHub API:

GitHub’s data sources (REST + GraphQL APIs) are not only developer-friendly, but a goldmine of interesting statistics on the health of developer communities. Companies like OpenSaucedlinearb, and TideLift can measure the impact of developers and their projects using statistics gleaned from GitHub’s APIs. The results of GitHub analysis can change both day-to-day and over time. 

Apache Kafka is a large and active open source project with nearly a million lines of code. It also happens to be an event streaming platform. So why not use Apache Kafka to, well, monitor itself? And learn a bit about Kafka Streams along the way?  

Click through for the full article, including a demonstration.

Comments closed

Exposing Kafka Data in Iceberg using Tableflow

Marc Selwan announces a new product:

We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.

Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products. 

It looks like this is currently in early access, but you can see where Confluent intends to take the product.

Comments closed

Combining Kafka and Flink

Gautam Goswami shares some thoughts:

In short, the process of collecting data in real-time as streams of events from event sources such as databases, sensors, and software applications is known as event streaming. With real-time data processing and analytics in mind, Apache Flink is a potent open-source program. For situations where quick insights and minimal processing latency are critical, it offers a consistent and effective platform for managing continuous streams of data. 

I’ve found it interesting that Confluent people have spent a lot of time the past several months talking up Apache Flink and Kafka+Flink combinations.

Comments closed

Generating Synthetic Data for Streaming in Microsoft Fabric

Sandeep Pawar builds out some data:

If you want to learn or demo Real Time Analytics in Microsoft Fabric, you will need a streaming data source. You can use the built-in samples to get started. But there are several data generators which you can use to create custom streaming sample datasets, Azure Stream Analytics data generator being one of them. You can see them here. In this blog, I will show how to set one up to use with Fabric Eventstream.

Read on for a step-by-step guide.

Comments closed

The Data Streaming Landscape in 2024

Kai Waehner gives us an overview of where data streaming technologies are at:

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations. Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. This blog post explores the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Kai is Kafka-centric but this is a good overview of the industry and worth taking the time to read.

Comments closed

Using Data Contracts in Confluent Schema Registry

Robert Yokota shows us how to generate data contracts for streaming solutions:

A data contract is a formal agreement between an upstream component and a downstream component on the structure and semantics of data that is in motion. The upstream component enforces the data contract, while the downstream component can assume that the data it receives conforms to the data contract. Data contracts are important because they provide transparency over dependencies and data usage in a streaming architecture. They help to ensure the consistency, reliability, and quality of the data in event streams, and they provide a single source of truth for understanding the data in motion.

Click through for a sample application that uses data contracts.

Comments closed

Running Apache Flink Jobs from HDInsight

Sairam Yeturi builds a streaming job:

Could you already complete creating your first Apache Flink® cluster and submit your streaming job on it with HDInsight on AKS?

Well, if you are yet to do that – Let me help you get started.

Click through for a step-by-step walkthrough on how to create a Flink-centric HDInsight cluster on Azure Kubernetes Service and how to create a new job, assuming you have the Jarfile for that job already.

Comments closed

Data Activator in Microsoft Fabric

Toby Smith looks at the current state of Data Activator in Microsoft Fabric:

Fabric is the newest all-in-one analytics solution from Microsoft. It combines multiple components (some existing, some new) into a single integrated environment. One of these new components is Data Activator. As Data Activator is still in development, there is still more functionality to be added. This blog shares some of the current abilities and uses for Data Activator, along with ideas for how you can use it in your own business situations.

One of the biggest challenges with big data is understanding it. With tools like Power BI, we are now able to understand and analyse data better than ever before. But when do we act on it? Do we have to manually look at these reports daily just to check everything is going ok? This is where Data Activator comes in. Data activator is a no-code tool that automatically takes actions when certain conditions are met in the data. These actions can vary from alerts in Microsoft Teams, calling stored procedures, triggering other fabric items like a pipeline, or even retraining AI models.

This is a feature which has enormous potential for near-real-time alerting and automating workflows. But do read on to learn about some of the limitations currently in the product.

Comments closed

Building a Flink Application in Java

Wade Waldron talks about a (free) new course:

Recently, I got my hands dirty working with Apache Flink®. The experience was a little overwhelming. I have spent years working with streaming technologies but Flink was new to me and the resources online were rarely what I needed. Thankfully, I had access to some of the best Flink experts in the business to provide me with first-class advice, but not everyone has access to an expert when they need one. 

To share what I learned, I created the Building Flink Applications in Java course on Confluent Developer. It provides you with hands-on experience in building a Flink application from the ground up. I also wrote this blog post to walk through an example of how to do dataflow programming with Flink. I hope these two resources will make the experience less overwhelming for others.

Click through for the blog post and check out the full course if you’re so inclined.

Comments closed

An Overview of Flink SQL

Martijn Visser continues a series on Kafka and Flink:

In the first two parts of our Inside Flink blog series, we explored the benefits of stream processing with Flink and common Flink use cases for which teams are choosing to leverage the popular framework to unlock the full potential of streaming. Specifically, we broke down the key reasons why developers are choosing Apache Flink® as their stream processing framework, as well as the ways in which they are putting it into practice. These range from streaming data pipelines to train ML models, to real-time inventory management in retail and predictive maintenance in manufacturing.

Next, we’ll dive into Flink SQL, which is a powerful data processing engine that allows developers to process and analyze large volumes of data in real time. We’ll cover how Flink SQL relates to the other Flink APIs and showcase some of its built-in functions and operations with syntax examples.

I’m naturally predisposed to blog posts which validate Feasel’s Law, so of course I was going to pick this one to recommend.

Comments closed