Press "Enter" to skip to content

Category: Streaming

Two Ways to Access Kafka Topics from R

Patrick Neff shows us a couple of ways to build a Kafka-to-R pipeline:

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to gain insights, find correlations, and finally develop and optimize your model. Then you transfer it and use it in your running system. A key point for every data scientist is not just the mathematical skills themselves, but also how to get the data into your analytics program.

In this blog post, we focus exactly on this crucial step: retrieving the data. In a second article, we’ll talk about running your model on real-time data.

Click through for the techniques.

Comments closed

Using Kafka for Security Situational Awareness

Kai Waehner continues a series on using Apache Kafka on security teams:

Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part two: Cyber Situational Awareness.

Click through for the high-level discussion.

Comments closed

Event Streaming for Security

Kai Waehner has a new series, and part 1 is all about using Apache Kafka as the backbone for a cybersecurity infrastructure:

This introductory post explored the basics of cybersecurity and how it relates respectively why it requires data in motion powered by Apache Kafka. The rest of the series will go deeper into specific topics that partly rely on each other.

Threat intelligence is only possible with situational awareness. Forensics is complementary. Deployments differ depending on security, safety, and compliance requirements.

Click through for the article.

Comments closed

Streaming Foreign Key Joins in Kafka Streams

John Roesler and Adam Bellemare take us in depth on a feature:

Before 2.4.0, the absence of foreign-key joins in Kafka Streams was palpable. As soon as you have a KTable abstraction, you start to think of relational-DB-esque things that you’d like to do with it, and joining two tables is near the top of the list. In addition, Kafka users often started out by implementing change data capture (CDC) of their main database tables, resulting in the production of normalized record streams reflecting the database model. These records often contain foreign-key references, requiring you to either denormalize entirely within your source database (which can be quite expensive), or handle them downstream in your consumer. The ability to compute denormalization on the fly is exactly in the sweet spot of use cases for Kafka Streams.

In versions prior to 2.4, there were workarounds available to compute a foreign-key join, using the ability to transform the table, filter it, aggregate on properties, and join on primary keys. But these workarounds were complex, prone to bugs, and not very efficient. A concrete plan to implement first-class support for this crucial operation was first put together when Jan Filipiak proposed KIP-213 in 2017. Adam Bellemare took over driving the proposal in 2018 and brought it to a conclusion in time for the 2.4.0 release.

Click through for examples of how it all works, as well as how you might optimize foreign key joins.

Comments closed

Identifying Backpressure in Apache Flink

Piotr Nowojski explains an important concept in streaming (and ELT/ETL) products:

The backpressure topic was tackled from different angles over the last couple of years. However, when it comes to identifying and analyzing sources of backpressure, things have changed quite a bit in the recent Flink releases (especially with new additions to metrics and the web UI in Flink 1.13). This post will try to clarify some of these changes and go into more detail about how to track down the source of backpressure, but first…

Read on for the full story, including a review of the concept and its importance.

Comments closed

Consistency and Completeness in Kafka Streams

Guozhang Wang announces a whitepaper:

Recently, however, some streaming engines, such as Apache Kafka® and its ecosystem component Kafka Streams, have been able to claim strong correctness guarantees, with the primary dual metrics being consistency, a guarantee that a stream processing application can recover from failures to a consistent state such that final results will not contain duplicates or lose any data, and completeness, a guarantee that a stream processing application does not generate incomplete partial outputs as final results even when input stream records may arrive out of order.

Click through for more details and a link to the paper itself. It’s good to understand as much as you can about the distributed system you use, especially because many times, the claims for consistency should come with large asterisks.

Comments closed

Data Pipeline Error Handling with Apache NiFi

Pieter Humphrey gives us a few techniques for handling data pipeline errors when running Apache NiFi:

The more complex the model, the more possible sources of problems exist. Forecasting every single potential problem is, of course, impossible. Identifying the most important ones and providing self-solving solutions can greatly reduce the operational uncertainty of our NiFi pipeline and improve its robustness.

To see how to do this analysis, we will consider four possible strategies: one external and three internal. They certainly do not cover all potential error scenarios, they are just examples that we can extrapolate from, and inform how to handle other potential failure domains.

Click through for an overview of the topic as well as those four techniques.

Comments closed

Recent Apache NiFi Updates

Pierre Villard has some news for us around Apache NiFi:

Cloudera released a lot of things around Apache NiFi recently! We just released Cloudera Flow Management (CFM) 2.1.1 that provides Apache NiFi on top of Cloudera Data Platform (CDP) 7.1.6. This major release provides the latest and greatest of Apache NiFi as it includes Apache NiFi 1.13.2 and additional improvements, bug fixes, components, etc. Cloudera also released CDP 7.2.9 on all three major cloud platforms, and it also brings Flow Management on DataHub with Apache NiFi 1.13.2 and more.  Let’s have a look at the main highlights of these releases.

Click through to see what’s included.

Comments closed

Batch Execution Mode in Flink’s DataStream API

Dawid Wysakowicz takes us through batch execution mode in a streaming solution:

Flink has been following the mantra that Batch is a Special Case of Streaming since the very early days. As the project evolved to address specific uses cases, different core APIs ended up being implemented for batch (DataSet API) and streaming execution (DataStream API), but the higher-level Table API/SQL was subsequently designed following this mantra of unification. With Flink 1.12, the community worked on bringing a similarly unified behaviour to the DataStream API, and took the first steps towards enabling efficient batch execution in the DataStream API.

Read on to see the progress they’ve achieved so far.

Comments closed

Updates to Message Keys in ksqlDB

Victoria Xia announces an improvement to ksqlDB:

One of the most highly requested enhancements to ksqlDB is here! Apache Kafka® messages may contain data in message keys as well as message values. Until now, ksqlDB could only read limited kinds of data from the key position. ksqlDB’s latest release—ksqlDB 0.15—adds support for many more types of data in messages keys, including message keys with multiple columns. Users of Confluent Cloud ksqlDB already have access to these new features as Confluent Cloud always runs the latest release of ksqlDB.

Read on for more information on this, as well as some of the ramifications of this change.

Comments closed