Hadoop – Page 28 – Curated SQL

Apache Kafka has the default ability to allow a topic to be created on a broker when a message is written to it and when a topic with the name the message is attempting to be written to does not exist. This can be very helpful in early development or prototyping where code, topic names, and schemas are in flux. However, past that early stage, it is recommended that Kafka be configured to disable the auto-creation of topics from messages for a few reasons. In this article I am going to touch on two of these reasons that are also core principles of Kafka partitions and Kafka topic replication.

Read on to understand what partitions and replication factor have to do with all of this.

Comments closed

RDD vs Dataframe vs Dataset in Spark

Published 2021-08-18 by Kevin Feasel

The Hadoop in Real World team disambiguates three APIs:

RDD, Dataframe and Dataset are all Spark APIs introduced in Spark at different points in time. The goal of these API is to help us work with large datasets in a distributed fashion in Spark with performance in mind.

Click through for the comparison.

Comments closed

Fixing Hadoop Namenodes in Safe Mode

Published 2021-08-16 by Kevin Feasel

The Hadoop in Real World team doesn’t need to play it safe:

When namenode is started or restarted, namenode will be in safemode for a period of time. At this time you will not be able to use your Hadoop cluster fully. Write operations to HDFS will fail and because of that your MapReduce jobs will also fail.

Read on for other reasons why the namenode might be stuck in safe mode and what you can do to fix it.

Comments closed

Change Data Capture with Kafka Connect and Cassandra

Published 2021-08-13 by Kevin Feasel

Paul Brebner picks up where a series left off:

We introduced the Debezium architecture and its use of Kafka Connect and explored how the Debezium Cassandra Connector (on the source side of the CDC pipeline) emits change events to Kafka for different database operations.
In the second part of this blog series, we examine how Kafka sink connectors can use the change data, discover that Debezium also propagates database schema changes (in different ways), and summarize our experiences with the Debezium Cassandra Connector used for customer deployment.

Read on for information on some of the concepts, as well as experiences working with the Debezium Cassandra connector.

Comments closed

Moving Data from Confluent Cloud to Cosmos DB

Published 2021-08-13 by Kevin Feasel

Nathan Ham announces the Azure Cosmos DB sink connector in Confluent Cloud:

Today, Confluent is announcing the general availability (GA) of the fully managed Azure Cosmos DB Sink Connector within Confluent Cloud. Now, with just a few simple clicks, you can link the power of Apache Kafka^® together with Azure Cosmos DB to set your data in motion.

Click through for a marketing-heavy look at how this works.

Comments closed

Generating Artificial Data with Databricks Generator

Published 2021-08-13 by Kevin Feasel

Ust Oldfield shows off a new tool:

Databricks Labs is a relatively new offering from Databricks which showcases what their teams have been creating in the field to help their customers. As a Consultant, this makes my life a lot easier as I don’t have to re-invent the wheel and I can use it to demonstrate value in partnering with Databricks. There’s plenty of use cases that I’ll be using, and extending, with my client but the one I want to focus on in this post is the Data Generator.

Read on for an example of how this works. Something not in Ust’s post but worth mentioning is that you can control the distribution of random numeric features. That’s a piece of functionality you often don’t see in data generators.

1 Comment

Kafka and SIEM/SOAR Tools

Published 2021-08-12 by Kevin Feasel

Kai Waehner wraps up a series on Apache Kafka and network security:

SIEM combines security information management (SIM) and security event management (SEM). They provide analysis of security alerts generated by applications and network hardware. Vendors sell SIEM as software, as appliances, or as managed services; these products are also used for logging security data and generating reports for compliance purposes.
SOAR tools automate security incident management investigations via a workflow automation workbook. The cyber intelligence API enables the playbook to automate research related to the ticket (lookup potential phishing URL, suspicious hash, etc.). The first responder determines the criticality of the event. At this level, it is either a normal or an escalation event. SOAR includes security incident response platforms (SIRPs), Security orchestration and automation (SOA), and threat intelligence platforms (TIPs).
In summary, SIEM and SOAR are key pieces of a modern cybersecurity infrastructure. The capabilities, use cases, and architectures are different for every company.

Click through to see where Kafka can fit in all of this.

Comments closed

Exporting a Hive Table to CSV

Published 2021-08-12 by Kevin Feasel

The Hadoop in Real World team shows how you can export data from a Hive table specifically into a file using comma-separated values:

It is a pretty common use case to export the contents of a Hive table into a CSV file. It’s pretty simple if you are using a recent version of Hive. In this post, we will see who to achieve this with both newer and older versions of Hive.

Read on to see both versions of the answer.

Comments closed

How Spark Determines Task Numbers and Parallelism

Published 2021-08-06 by Kevin Feasel

The Hadoop in Real World team explains how the Spark engine decides how many tasks to create for a job and how many can run in parallel:

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.
Let’s see how Spark decides on the number of tasks with the below set of instructions.
[… instructions]
Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.

Click through for the full explanation.

Comments closed

Helpful Tools for Apache Kafka Developers

Published 2021-08-06 by Kevin Feasel

Dave Klein has a few tools to make working with Apache Kafka a little easier:

We like to save the best for last, but this tool is too good to wait. So, we’ll start off by covering kafkacat.
kafkacat is a fast and flexible command line Kafka producer, consumer, and more. Magnus Edenhill, the author of the librdkafka C/C++ library for Kafka, developed it. kafkacat is great for quickly producing and consuming data to and from a topic. In fact, the same command will do both, depending on the context. Check this out:

Read on for more information on this tool, as well as several others.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop

Thoughts on Topic Replication in Kafka

RDD vs Dataframe vs Dataset in Spark

Fixing Hadoop Namenodes in Safe Mode

Change Data Capture with Kafka Connect and Cassandra

Moving Data from Confluent Cloud to Cosmos DB

Generating Artificial Data with Databricks Generator

Kafka and SIEM/SOAR Tools

Exporting a Hive Table to CSV

How Spark Determines Task Numbers and Parallelism

Helpful Tools for Apache Kafka Developers