Press "Enter" to skip to content

Category: Hadoop

Change Data Capture with Kafka Connect and Cassandra

Paul Brebner picks up where a series left off:

We introduced the Debezium architecture and its use of Kafka Connect and explored how the Debezium Cassandra Connector (on the source side of the CDC pipeline) emits change events to Kafka for different database operations. 

In the second part of this blog series, we examine how Kafka sink connectors can use the change data, discover that Debezium also propagates database schema changes (in different ways), and summarize our experiences with the Debezium Cassandra Connector used for customer deployment. 

Read on for information on some of the concepts, as well as experiences working with the Debezium Cassandra connector.

Comments closed

Moving Data from Confluent Cloud to Cosmos DB

Nathan Ham announces the Azure Cosmos DB sink connector in Confluent Cloud:

Today, Confluent is announcing the general availability (GA) of the fully managed Azure Cosmos DB Sink Connector within Confluent Cloud. Now, with just a few simple clicks, you can link the power of Apache Kafka® together with Azure Cosmos DB to set your data in motion.

Click through for a marketing-heavy look at how this works.

Comments closed

Generating Artificial Data with Databricks Generator

Ust Oldfield shows off a new tool:

Databricks Labs is a relatively new offering from Databricks which showcases what their teams have been creating in the field to help their customers. As a Consultant, this makes my life a lot easier as I don’t have to re-invent the wheel and I can use it to demonstrate value in partnering with Databricks. There’s plenty of use cases that I’ll be using, and extending, with my client but the one I want to focus on in this post is the Data Generator.

Read on for an example of how this works. Something not in Ust’s post but worth mentioning is that you can control the distribution of random numeric features. That’s a piece of functionality you often don’t see in data generators.

1 Comment

Kafka and SIEM/SOAR Tools

Kai Waehner wraps up a series on Apache Kafka and network security:

SIEM combines security information management (SIM) and security event management (SEM). They provide analysis of security alerts generated by applications and network hardware. Vendors sell SIEM as software, as appliances, or as managed services; these products are also used for logging security data and generating reports for compliance purposes.

SOAR tools automate security incident management investigations via a workflow automation workbook. The cyber intelligence API enables the playbook to automate research related to the ticket (lookup potential phishing URL, suspicious hash, etc.). The first responder determines the criticality of the event. At this level, it is either a normal or an escalation event. SOAR includes security incident response platforms (SIRPs), Security orchestration and automation (SOA), and threat intelligence platforms (TIPs).

In summary, SIEM and SOAR are key pieces of a modern cybersecurity infrastructure. The capabilities, use cases, and architectures are different for every company.

Click through to see where Kafka can fit in all of this.

Comments closed

How Spark Determines Task Numbers and Parallelism

The Hadoop in Real World team explains how the Spark engine decides how many tasks to create for a job and how many can run in parallel:

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.

Let’s see how Spark decides on the number of tasks with the below set of instructions.

[… instructions]

Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.

Click through for the full explanation.

Comments closed

Helpful Tools for Apache Kafka Developers

Dave Klein has a few tools to make working with Apache Kafka a little easier:

We like to save the best for last, but this tool is too good to wait. So, we’ll start off by covering kafkacat.

kafkacat is a fast and flexible command line Kafka producer, consumer, and more. Magnus Edenhill, the author of the librdkafka C/C++ library for Kafka, developed it. kafkacat is great for quickly producing and consuming data to and from a topic. In fact, the same command will do both, depending on the context. Check this out:

Read on for more information on this tool, as well as several others.

Comments closed

A Warning: VPCs and Distributed Database Platforms

Wade Trimmer takes us through a reason why you might not want to use VPC endpoints to separate applications from distributed database platforms:

AWS PrivateLink (also known as a VPC endpoint) is a technology that allows the user to securely access services using a private IP address. It is not recommended to configure an AWS PrivateLink connection with Apache Kafka or Apache Cassandra mainly due to a single entry point problem. PrivateLink only exposes a single IP to the user and requires a load balancer between the user and the service. Realistically, the user would need to have an individual VPC endpoint per node, which is expensive and may not work. 

Using PrivateLink, it is impossible by design to contact specific IPs within a VPC in the same way you can with VPC peering. VPC peering allows a connection between two VPCs, while PrivateLink publishes an endpoint that others can connect to from their own VPC.

Read on to understand how this affects platforms like Cassandra and Kafka.

Comments closed

Tips for Decreasing the Impact of Rebalancing in Kafka Streams

Vasyl Sarzhynskyi has some techniques to make rebalancing in Kafka less of a big deal:

Kafka Rebalance happens when a new consumer is either added (joined) into the consumer group or removed (left). It becomes dramatic during application service deployment rollout, as multiple instances restarted at the same time, and rebalance latency significantly increasing. During rebalance, consumers stop processing messages for some period of time, and, as a result, processing of events from a topic happens with some delay. Some business cases could tolerate rebalancing, meanwhile, others require real-time event processing and it’s painful to have delays in more than a few seconds. Here we will try to figure out how to decrease rebalance for Kafka-Streams clients (even though some tips will be useful for other Kafka consumer clients as well).

Read on for an example of the problem, as well as several practical tips for mitigating the issue.

Comments closed