Press "Enter" to skip to content

Category: Streaming

Tips for Decreasing the Impact of Rebalancing in Kafka Streams

Vasyl Sarzhynskyi has some techniques to make rebalancing in Kafka less of a big deal:

Kafka Rebalance happens when a new consumer is either added (joined) into the consumer group or removed (left). It becomes dramatic during application service deployment rollout, as multiple instances restarted at the same time, and rebalance latency significantly increasing. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. Some business cases could tolerate rebalancing, meanwhile, others require real-time event processing and it’s painful to have delays in more than a few seconds. Here we will try to figure out how to decrease rebalance for Kafka-Streams clients (even though some tips will be useful for other Kafka consumer clients as well).

Read on for an example of the problem, as well as several practical tips for mitigating the issue.

Leave a Comment

Digital Forensics with Apache Kafka

Kai Waehner continues a series on using Apache Kafka as the backbone for computer security:

Storing data long-term in Kafka is possible since the beginning. Each Kafka topic gets a retention time. Many use cases use a retention time of a few hours or days as the data is only processed and stored in another system (like a database or data warehouse). However, more and more projects use a retention time of a few years or even -1 (= forever) for some Kafka topics (e.g., due to compliance reasons or to store transactional data).

The drawback of using Kafka for forensics is the huge volume of historical data and its related high cost and scalability issues. This gets pretty expensive as Kafka uses regular HDDs or SDDS as the disk storage. Additionally, data rebalancing between brokers (e.g., if a new broker is added to a cluster) takes a long time for huge volumes of data sets. Hence, rebalancing takes hours can impact scalability and reliability.

But there is a solution to these challenges: Tiered Storage.

Click through to learn more.

Leave a Comment

Threat Intelligence and Kafka

Kai Waehner continues a series on using Apache Kafka as the foundation for a security solution:

Threat intelligence, or cyber threat intelligence, reduces harm by improving decision-making before, during, and after cybersecurity incidents reducing operational mean time to recovery, and reducing adversary dwell time for information technology environments.

Threat intelligence is evidence-based knowledge, including context, mechanisms, indicators, implications, and action-oriented advice about an existing or emerging menace or hazard to assets. This intelligence can be used to inform decisions regarding the subject’s response to that menace or hazard.

Threat intelligence solutions gather raw data about emerging or existing threat actors & threats from various sources. This data is then analyzed and filtered to produce threat intel feeds and management reports that contain information that automated security control solutions can use.

Threat intelligence keeps organizations informed of the risks of advanced persistent threats, zero-day threats and exploits, and how to protect against them.

Read the whole thing.

1 Comment

ksqldb 0.19.0 Released

Tom Nguyen announces a new version of ksqldb:

ksqlDB 0.19.0 adds support for foreign-key joins between tables. Data decomposition into multiple tables (i.e., schema normalization) is a key strength of the relational data model and often requires joining tables based on a foreign key. So far, we have been able to provide tools for normalizing data, provided the rows in each of the tables followed a one-to-one relationship (i.e., have the same primary key).

Providing built-in support for foreign-key joins, which was previously only possible to do through workarounds, unlocks many new use cases where you’d like to have a many-to-one relationship between your tables. This is a highly demanded feature, and we are excited to finally make it available.

Click through to see what else they’ve included.

Leave a Comment

Feeding Data from Kafka into Splunk

Guy Shilo performs a bit of data migration:

Kafka connect is a framework that uses Kafka topics for collecting data from various sources and distributing it to different sinks. It comes bundled with Kafka installation but can run independently from Kafka brokers and access them remotely. Here is an explanation about what Kafka connect is and it’s architecture. It is also a good candidate for running on Kubernetes since it only uses outgoing communication.

The framework uses plugins to be able to talk to different sources and sinks. There are many ready plugins for a variety of systems. Some of them are free and some are licensed to companies like Confluent or Debezium. many of them can be found here. Some systems can be a source of data, some can be a sink and some can be both. Basically a source adapter polls the source system for changes, pulls the data and publish it in a Kafka topic. A sink adapter subscribes to a Kafka topic, gets incoming events and exports them into the target system.

As I mentioned, there are several dozens of supported adapters. Just for the demonstration we will capture events from kafka topic and store them in splunk for visualization and investigation.

Click through to see how it all fits together.

Leave a Comment

Two Ways to Access Kafka Topics from R

Patrick Neff shows us a couple of ways to build a Kafka-to-R pipeline:

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to gain insights, find correlations, and finally develop and optimize your model. Then you transfer it and use it in your running system. A key point for every data scientist is not just the mathematical skills themselves, but also how to get the data into your analytics program.

In this blog post, we focus exactly on this crucial step: retrieving the data. In a second article, we’ll talk about running your model on real-time data.

Click through for the techniques.

Leave a Comment

Using Kafka for Security Situational Awareness

Kai Waehner continues a series on using Apache Kafka on security teams:

Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part two: Cyber Situational Awareness.

Click through for the high-level discussion.

Leave a Comment

Event Streaming for Security

Kai Waehner has a new series, and part 1 is all about using Apache Kafka as the backbone for a cybersecurity infrastructure:

This introductory post explored the basics of cybersecurity and how it relates respectively why it requires data in motion powered by Apache Kafka. The rest of the series will go deeper into specific topics that partly rely on each other.

Threat intelligence is only possible with situational awareness. Forensics is complementary. Deployments differ depending on security, safety, and compliance requirements.

Click through for the article.

Leave a Comment

Streaming Foreign Key Joins in Kafka Streams

John Roesler and Adam Bellemare take us in depth on a feature:

Before 2.4.0, the absence of foreign-key joins in Kafka Streams was palpable. As soon as you have a KTable abstraction, you start to think of relational-DB-esque things that you’d like to do with it, and joining two tables is near the top of the list. In addition, Kafka users often started out by implementing change data capture (CDC) of their main database tables, resulting in the production of normalized record streams reflecting the database model. These records often contain foreign-key references, requiring you to either denormalize entirely within your source database (which can be quite expensive), or handle them downstream in your consumer. The ability to compute denormalization on the fly is exactly in the sweet spot of use cases for Kafka Streams.

In versions prior to 2.4, there were workarounds available to compute a foreign-key join, using the ability to transform the table, filter it, aggregate on properties, and join on primary keys. But these workarounds were complex, prone to bugs, and not very efficient. A concrete plan to implement first-class support for this crucial operation was first put together when Jan Filipiak proposed KIP-213 in 2017. Adam Bellemare took over driving the proposal in 2018 and brought it to a conclusion in time for the 2.4.0 release.

Click through for examples of how it all works, as well as how you might optimize foreign key joins.

Comments closed

Identifying Backpressure in Apache Flink

Piotr Nowojski explains an important concept in streaming (and ELT/ETL) products:

The backpressure topic was tackled from different angles over the last couple of years. However, when it comes to identifying and analyzing sources of backpressure, things have changed quite a bit in the recent Flink releases (especially with new additions to metrics and the web UI in Flink 1.13). This post will try to clarify some of these changes and go into more detail about how to track down the source of backpressure, but first…

Read on for the full story, including a review of the concept and its importance.

Comments closed