Streaming – Page 2 – Curated SQL

Telegraf Performance Optimization

Published 2024-07-23 by Kevin Feasel

Riya shares a few tips on making Telegraf stream data more efficiently:

As businesses grow and their infrastructures become more complex, monitoring becomes a critical component of maintaining system health and performance. Telegraf, an open-source server agent for collecting and sending metrics and events from databases, systems, and IoT sensors, is widely used for this purpose. However, handling high volumes of metrics can strain resources and degrade performance. This blog will explore strategies for optimizing Telegraf’s performance when dealing with high-volume metrics.

Click through for an architectural overview and five things you can do to optimize performance.

Comments closed

Transforming a REST API into a Data Stream

Published 2024-06-18 by Kevin Feasel

Lucia Cerchie and Dave Troiano build a stream:

In the space of APIs for consuming up-to-date data (say, events or state available within an hour of occurring) many API paradigms exist. There are file- or object-based paradigms, e.g., S3 access. There’s database access, e.g., direct Snowflake access. Last, we have decoupled client-server APIs, e.g., REST APIs, gRPC, webhooks, and streaming APIs. In this context, “decoupled” means that the client usually communicates with the server over a language-agnostic standard network protocol like HTTP/S, usually receives data in a standard format like JSON, and, in contrast to direct database access, typically doesn’t know what data store backs the API.

Of the above styles, more often than not, API developers settle on HTTP-based REST APIs for a number of reasons. They are incredibly popular. More developers know how to use REST APIs and are using them in production compared to other API technologies. For example, Rapid API’s 2022 State of APIs reports 69.3% of survey respondents using REST APIs in production, well above the percentage using alternatives like gRPC (8.2%), GraphQL (18.6%), or webhooks (34.6%).

Click through for a demonstration of how to take an existing REST API and build a data stream out of it using Apache Kafka and Apache Flink.

Comments closed

Combining Flink SQL, Streamlit, and Kafka

Published 2024-06-13 by Kevin Feasel

Lucia Cerchie has a pair of posts. First up, Lucia sets the stage:

n part 1 of this series, we’ll make an app, hosted on Streamlit, that allows a user to select a stock, in this case SPY, or the SPDR S&P 500 ETF Trust. Upon selection, a live chart of the stock’s bid prices, calculated every five seconds, will appear.

What are the pieces that go into making this work? The source of the data is the Alpaca Market Data API. We’ll hook up a Kafka producer to the websocket stream and send data to a Kafka topic in Confluent Cloud. Then we’ll use Flink SQL within Confluent Cloud’s Flink SQL workspace to tumble an average bid price every five seconds. Finally, we’ll use a Kafka consumer to receive that data and populate it to a Streamlit component in real time. This frontend component will be deployed on Streamlit as well.

Part 2 then closes the trap:

In part one of this series, we walked through how to use Streamlit, Apache Kafka®, and Apache Flink® to create a live data-driven user interface for a market data application to select a stock (e.g., SPY) and discussed the structure of the app at a high level. First, data with information on stock bid prices is moved via an Alpaca websocket, then, it’s produced to a Kafka topic in Confluent Cloud where it is also processed with Flink SQL.

Now comes the tricky part: running the Kafka consumer and producer in the same application.

Click through for a good demonstration of a practical solution. Lucia also has a GitHub repo with all of the code, a demo of the site in action, and some links to additional resources.

Comments closed

Processing GitHub Data with Kafka Streams

Published 2024-03-28 by Kevin Feasel

Lucia Cerchie hits the GItHub API:

GitHub’s data sources (REST + GraphQL APIs) are not only developer-friendly, but a goldmine of interesting statistics on the health of developer communities. Companies like OpenSauced, linearb, and TideLift can measure the impact of developers and their projects using statistics gleaned from GitHub’s APIs. The results of GitHub analysis can change both day-to-day and over time.

Apache Kafka is a large and active open source project with nearly a million lines of code. It also happens to be an event streaming platform. So why not use Apache Kafka to, well, monitor itself? And learn a bit about Kafka Streams along the way?

Click through for the full article, including a demonstration.

Comments closed

Exposing Kafka Data in Iceberg using Tableflow

Published 2024-03-20 by Kevin Feasel

Marc Selwan announces a new product:

We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.

Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products.

It looks like this is currently in early access, but you can see where Confluent intends to take the product.

Comments closed

Combining Kafka and Flink

Published 2024-02-20 by Kevin Feasel

Gautam Goswami shares some thoughts:

In short, the process of collecting data in real-time as streams of events from event sources such as databases, sensors, and software applications is known as event streaming. With real-time data processing and analytics in mind, Apache Flink is a potent open-source program. For situations where quick insights and minimal processing latency are critical, it offers a consistent and effective platform for managing continuous streams of data.

I’ve found it interesting that Confluent people have spent a lot of time the past several months talking up Apache Flink and Kafka+Flink combinations.

Comments closed

Generating Synthetic Data for Streaming in Microsoft Fabric

Published 2024-02-08 by Kevin Feasel

Sandeep Pawar builds out some data:

If you want to learn or demo Real Time Analytics in Microsoft Fabric, you will need a streaming data source. You can use the built-in samples to get started. But there are several data generators which you can use to create custom streaming sample datasets, Azure Stream Analytics data generator being one of them. You can see them here. In this blog, I will show how to set one up to use with Fabric Eventstream.

Read on for a step-by-step guide.

Comments closed

The Data Streaming Landscape in 2024

Published 2023-12-22 by Kevin Feasel

Kai Waehner gives us an overview of where data streaming technologies are at:

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations. Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. This blog post explores the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Kai is Kafka-centric but this is a good overview of the industry and worth taking the time to read.

Comments closed

Using Data Contracts in Confluent Schema Registry

Published 2023-10-19 by Kevin Feasel

Robert Yokota shows us how to generate data contracts for streaming solutions:

A data contract is a formal agreement between an upstream component and a downstream component on the structure and semantics of data that is in motion. The upstream component enforces the data contract, while the downstream component can assume that the data it receives conforms to the data contract. Data contracts are important because they provide transparency over dependencies and data usage in a streaming architecture. They help to ensure the consistency, reliability, and quality of the data in event streams, and they provide a single source of truth for understanding the data in motion.

Click through for a sample application that uses data contracts.

Comments closed

Running Apache Flink Jobs from HDInsight

Published 2023-10-18 by Kevin Feasel

Sairam Yeturi builds a streaming job:

Could you already complete creating your first Apache Flink® cluster and submit your streaming job on it with HDInsight on AKS?

Well, if you are yet to do that – Let me help you get started.

Click through for a step-by-step walkthrough on how to create a Flink-centric HDInsight cluster on Azure Kubernetes Service and how to create a new job, assuming you have the Jarfile for that job already.

Comments closed

Category: Streaming