Category: Streaming

Kafka is widely adopted for building real-time streaming applications due to its fault tolerance, scalability, and ability to process large volumes of data. However, in general, Kafka streaming consumers work best only in an environment where they do not have to call external APIs or databases. In a situation when a Kafka consumer must make a synchronous database or API call, the latency introduced by network hops or I/O operations adds up and accumulates easily (especially when the streaming pipeline is performing an initial load of a large volume of data before starting CDC). This can significantly slow down the streaming pipeline and result in the blowing of system resources impacting the throughput of the pipeline. In extreme situations, this may even become unsustainable as Kafka consumers may not be able to commit offsets due to increased latency before the next polling call and get continuously rebalanced by the broker, practically not processing anything yet incrementally consuming more system resources as time passes.

This is a real problem faced by many streaming applications. In this article, we’ll explore some effective strategies to minimize latency in Kafka streaming applications where external API or database calls are inevitable. We’ll also compare these strategies with the alternative approach of separating out the parts of the pipeline that require these external interactions into a separate publish/subscribe-based consumer.

Read on to understand the causes of this latency and several patterns you can use to limit it.

Comments closed

Tracking Airport Traffic with Flink, Kafka, and NiFi

Published 2024-09-30 by Kevin Feasel

Tim Spann builds an app:

The above link utilizes the standard REST link and enhances it by setting the beginning date using NiFi’s Expression language to get the current time in UNIX format in seconds. In this example, I am looking at the last week of data for the airport departures and arrivals in the second URL.

We iterate through a list of the largest airports in the United States doing both departures and arrivals since they use the same format.

Read the article to learn more about how you can tie it all together. You can also check out Tim’s GitHub repo to grab the code.

Comments closed

Real-Time Streaming in Azure

Published 2024-09-06 by Kevin Feasel

Temidayo Omoniyi takes us through an architecture:

In today’s world, billions of data are generated daily from messaging applications like WhatsApp, financial data like the New York Stock Exchange, or video streaming platforms like YouTube. As a data engineer or solution architect, you are tasked to design a real-time streaming platform that captures the data as they are generated and stored in the necessary storage for decision-making.

This does a great job of going into detail, not only at the architectural level, but also setup and practical implementation.

Comments closed

Telegraf Performance Optimization

Published 2024-07-23 by Kevin Feasel

Riya shares a few tips on making Telegraf stream data more efficiently:

As businesses grow and their infrastructures become more complex, monitoring becomes a critical component of maintaining system health and performance. Telegraf, an open-source server agent for collecting and sending metrics and events from databases, systems, and IoT sensors, is widely used for this purpose. However, handling high volumes of metrics can strain resources and degrade performance. This blog will explore strategies for optimizing Telegraf’s performance when dealing with high-volume metrics.

Click through for an architectural overview and five things you can do to optimize performance.

Comments closed

Transforming a REST API into a Data Stream

Published 2024-06-18 by Kevin Feasel

Lucia Cerchie and Dave Troiano build a stream:

In the space of APIs for consuming up-to-date data (say, events or state available within an hour of occurring) many API paradigms exist. There are file- or object-based paradigms, e.g., S3 access. There’s database access, e.g., direct Snowflake access. Last, we have decoupled client-server APIs, e.g., REST APIs, gRPC, webhooks, and streaming APIs. In this context, “decoupled” means that the client usually communicates with the server over a language-agnostic standard network protocol like HTTP/S, usually receives data in a standard format like JSON, and, in contrast to direct database access, typically doesn’t know what data store backs the API.

Of the above styles, more often than not, API developers settle on HTTP-based REST APIs for a number of reasons. They are incredibly popular. More developers know how to use REST APIs and are using them in production compared to other API technologies. For example, Rapid API’s 2022 State of APIs reports 69.3% of survey respondents using REST APIs in production, well above the percentage using alternatives like gRPC (8.2%), GraphQL (18.6%), or webhooks (34.6%).

Click through for a demonstration of how to take an existing REST API and build a data stream out of it using Apache Kafka and Apache Flink.

Comments closed

Combining Flink SQL, Streamlit, and Kafka

Published 2024-06-13 by Kevin Feasel

Lucia Cerchie has a pair of posts. First up, Lucia sets the stage:

n part 1 of this series, we’ll make an app, hosted on Streamlit, that allows a user to select a stock, in this case SPY, or the SPDR S&P 500 ETF Trust. Upon selection, a live chart of the stock’s bid prices, calculated every five seconds, will appear.

What are the pieces that go into making this work? The source of the data is the Alpaca Market Data API. We’ll hook up a Kafka producer to the websocket stream and send data to a Kafka topic in Confluent Cloud. Then we’ll use Flink SQL within Confluent Cloud’s Flink SQL workspace to tumble an average bid price every five seconds. Finally, we’ll use a Kafka consumer to receive that data and populate it to a Streamlit component in real time. This frontend component will be deployed on Streamlit as well.

Part 2 then closes the trap:

In part one of this series, we walked through how to use Streamlit, Apache Kafka®, and Apache Flink® to create a live data-driven user interface for a market data application to select a stock (e.g., SPY) and discussed the structure of the app at a high level. First, data with information on stock bid prices is moved via an Alpaca websocket, then, it’s produced to a Kafka topic in Confluent Cloud where it is also processed with Flink SQL.

Now comes the tricky part: running the Kafka consumer and producer in the same application.

Click through for a good demonstration of a practical solution. Lucia also has a GitHub repo with all of the code, a demo of the site in action, and some links to additional resources.

Comments closed

Processing GitHub Data with Kafka Streams

Published 2024-03-28 by Kevin Feasel

Lucia Cerchie hits the GItHub API:

GitHub’s data sources (REST + GraphQL APIs) are not only developer-friendly, but a goldmine of interesting statistics on the health of developer communities. Companies like OpenSauced, linearb, and TideLift can measure the impact of developers and their projects using statistics gleaned from GitHub’s APIs. The results of GitHub analysis can change both day-to-day and over time.

Apache Kafka is a large and active open source project with nearly a million lines of code. It also happens to be an event streaming platform. So why not use Apache Kafka to, well, monitor itself? And learn a bit about Kafka Streams along the way?

Click through for the full article, including a demonstration.

Comments closed

Exposing Kafka Data in Iceberg using Tableflow

Published 2024-03-20 by Kevin Feasel

Marc Selwan announces a new product:

We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.

Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products.

It looks like this is currently in early access, but you can see where Confluent intends to take the product.

Comments closed

Combining Kafka and Flink

Published 2024-02-20 by Kevin Feasel

Gautam Goswami shares some thoughts:

In short, the process of collecting data in real-time as streams of events from event sources such as databases, sensors, and software applications is known as event streaming. With real-time data processing and analytics in mind, Apache Flink is a potent open-source program. For situations where quick insights and minimal processing latency are critical, it offers a consistent and effective platform for managing continuous streams of data.

I’ve found it interesting that Confluent people have spent a lot of time the past several months talking up Apache Flink and Kafka+Flink combinations.

Comments closed

Generating Synthetic Data for Streaming in Microsoft Fabric

Published 2024-02-08 by Kevin Feasel

Sandeep Pawar builds out some data:

If you want to learn or demo Real Time Analytics in Microsoft Fabric, you will need a streaming data source. You can use the built-in samples to get started. But there are several data generators which you can use to create custom streaming sample datasets, Azure Stream Analytics data generator being one of them. You can see them here. In this blog, I will show how to set one up to use with Fabric Eventstream.

Read on for a step-by-step guide.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31