Category: Streaming

Where Kafka Connect Fits

Published 2021-09-17 by Kevin Feasel

Shivani Sarthi explains the value of Kafka Connect:

Kafka connect is not just a free, open source component of Apache Kafka. But it also works as a centralised data hub for simple data integration between databases, key-value stores etc. The fundamental components include-
– Connectors
– Tasks
– Workers
– Converters
– Transforms
– Dead letter Queue
Moreover it is a framework to stream data in and out of Apache Kafka. In addition, the confluent platform comes with many built-in connectors,used for streaming data to and from different data sources.

Click through for information on each component.

Comments closed

Data Lakehouse Point-of-Sale Analytics Demo

Published 2021-09-10 by Kevin Feasel

Bryan Smith and Rob Saker share a pattern:

Disruptions in the supply chain – from reduced product supply and diminished warehouse capacity – coupled with rapidly shifting consumer expectations for seamless omnichannel experiences are driving retailers to rethink how they use data to manage their operations. Prior to the pandemic, 71% of retailers named lack of real-time visibility into inventory as a top obstacle to achieving their omnichannel goals. The pandemic only increased demand for integrated online and in-store experiences, placing even more pressure on retailers to present accurate product availability and manage order changes on the fly. Better access to real-time information is the key to meeting consumer demands in the new normal.
In this blog, we’ll address the need for real-time data in retail, and how to overcome the challenges of moving real-time streaming of point-of-sale data at scale with a data lakehouse.

It’s a cool scenario, at the least.

Comments closed

Change Data Capture with Kafka Connect and Cassandra

Published 2021-08-13 by Kevin Feasel

Paul Brebner picks up where a series left off:

We introduced the Debezium architecture and its use of Kafka Connect and explored how the Debezium Cassandra Connector (on the source side of the CDC pipeline) emits change events to Kafka for different database operations.
In the second part of this blog series, we examine how Kafka sink connectors can use the change data, discover that Debezium also propagates database schema changes (in different ways), and summarize our experiences with the Debezium Cassandra Connector used for customer deployment.

Read on for information on some of the concepts, as well as experiences working with the Debezium Cassandra connector.

Comments closed

Tips for Decreasing the Impact of Rebalancing in Kafka Streams

Published 2021-07-28 by Kevin Feasel

Vasyl Sarzhynskyi has some techniques to make rebalancing in Kafka less of a big deal:

Kafka Rebalance happens when a new consumer is either added (joined) into the consumer group or removed (left). It becomes dramatic during application service deployment rollout, as multiple instances restarted at the same time, and rebalance latency significantly increasing. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. Some business cases could tolerate rebalancing, meanwhile, others require real-time event processing and it’s painful to have delays in more than a few seconds. Here we will try to figure out how to decrease rebalance for Kafka-Streams clients (even though some tips will be useful for other Kafka consumer clients as well).

Read on for an example of the problem, as well as several practical tips for mitigating the issue.

Comments closed

Digital Forensics with Apache Kafka

Published 2021-07-26 by Kevin Feasel

Kai Waehner continues a series on using Apache Kafka as the backbone for computer security:

Storing data long-term in Kafka is possible since the beginning. Each Kafka topic gets a retention time. Many use cases use a retention time of a few hours or days as the data is only processed and stored in another system (like a database or data warehouse). However, more and more projects use a retention time of a few years or even -1 (= forever) for some Kafka topics (e.g., due to compliance reasons or to store transactional data).
The drawback of using Kafka for forensics is the huge volume of historical data and its related high cost and scalability issues. This gets pretty expensive as Kafka uses regular HDDs or SDDS as the disk storage. Additionally, data rebalancing between brokers (e.g., if a new broker is added to a cluster) takes a long time for huge volumes of data sets. Hence, rebalancing takes hours can impact scalability and reliability.
But there is a solution to these challenges: Tiered Storage.

Click through to learn more.

Comments closed

Threat Intelligence and Kafka

Published 2021-07-23 by Kevin Feasel

Kai Waehner continues a series on using Apache Kafka as the foundation for a security solution:

Threat intelligence, or cyber threat intelligence, reduces harm by improving decision-making before, during, and after cybersecurity incidents reducing operational mean time to recovery, and reducing adversary dwell time for information technology environments.
Threat intelligence is evidence-based knowledge, including context, mechanisms, indicators, implications, and action-oriented advice about an existing or emerging menace or hazard to assets. This intelligence can be used to inform decisions regarding the subject’s response to that menace or hazard.
Threat intelligence solutions gather raw data about emerging or existing threat actors & threats from various sources. This data is then analyzed and filtered to produce threat intel feeds and management reports that contain information that automated security control solutions can use.
Threat intelligence keeps organizations informed of the risks of advanced persistent threats, zero-day threats and exploits, and how to protect against them.

Read the whole thing.

1 Comment

ksqldb 0.19.0 Released

Published 2021-07-22 by Kevin Feasel

Tom Nguyen announces a new version of ksqldb:

ksqlDB 0.19.0 adds support for foreign-key joins between tables. Data decomposition into multiple tables (i.e., schema normalization) is a key strength of the relational data model and often requires joining tables based on a foreign key. So far, we have been able to provide tools for normalizing data, provided the rows in each of the tables followed a one-to-one relationship (i.e., have the same primary key).
Providing built-in support for foreign-key joins, which was previously only possible to do through workarounds, unlocks many new use cases where you’d like to have a many-to-one relationship between your tables. This is a highly demanded feature, and we are excited to finally make it available.

Click through to see what else they’ve included.

Comments closed

Feeding Data from Kafka into Splunk

Published 2021-07-22 by Kevin Feasel

Guy Shilo performs a bit of data migration:

Kafka connect is a framework that uses Kafka topics for collecting data from various sources and distributing it to different sinks. It comes bundled with Kafka installation but can run independently from Kafka brokers and access them remotely. Here is an explanation about what Kafka connect is and it’s architecture. It is also a good candidate for running on Kubernetes since it only uses outgoing communication.
The framework uses plugins to be able to talk to different sources and sinks. There are many ready plugins for a variety of systems. Some of them are free and some are licensed to companies like Confluent or Debezium. many of them can be found here. Some systems can be a source of data, some can be a sink and some can be both. Basically a source adapter polls the source system for changes, pulls the data and publish it in a Kafka topic. A sink adapter subscribes to a Kafka topic, gets incoming events and exports them into the target system.
As I mentioned, there are several dozens of supported adapters. Just for the demonstration we will capture events from kafka topic and store them in splunk for visualization and investigation.

Click through to see how it all fits together.

Comments closed

Two Ways to Access Kafka Topics from R

Published 2021-07-21 by Kevin Feasel

Patrick Neff shows us a couple of ways to build a Kafka-to-R pipeline:

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to gain insights, find correlations, and finally develop and optimize your model. Then you transfer it and use it in your running system. A key point for every data scientist is not just the mathematical skills themselves, but also how to get the data into your analytics program.
In this blog post, we focus exactly on this crucial step: retrieving the data. In a second article, we’ll talk about running your model on real-time data.

Click through for the techniques.

Comments closed

Using Kafka for Security Situational Awareness

Published 2021-07-15 by Kevin Feasel

Kai Waehner continues a series on using Apache Kafka on security teams:

Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part two: Cyber Situational Awareness.

Click through for the high-level discussion.

Comments closed