Press "Enter" to skip to content

Category: Hadoop

Resetting a Consumer Offset in Kafka

The Hadoop in Real World team shows how to update the consumer offset in Kafka:

In some scenarios, consumers which consumed the messages from a Kafka partition could have resulted in errors and the consumption would have been incomplete. In such cases of consumption failures you may have a need to re-consume the messages which were previously consumed. In such instances you would have to reset the consumer offset to an earlier offset.

This is one of the big advantages of a log-based message broker versus a queue: if you find a bug in a downstream consumer, it’s easy to generate correct results after fixing the bug, something which can be much harder to do otherwise.

Comments closed

Building a Kafka Test Environment with Kafdrop

Diogo Souza walks us through an interesting project:

From a daily life standpoint, it’s challenging to manage Kafka brokers, partitions, topics, producers, and consumers all via command line. An interface would be quite helpful.

There is a ton of available options for managing your Kafka brokers for web UI applications. Perhaps Confluent’s version is one of the most complete, although it is part of a paid combo for mostly enterprise means.

Amongst the myriad of open-source options, Kafdrop stands out for being simple, fast, and easy to use. It is an open-source web project that allows you to view information from Kafka brokers as existing topics, consumers, and even the content of messages sent.

This article explores creating a more flexible test environment to work alongside the .NET app built in the previous article. This way, you’ll have more powerful tools to understand what’s happening with your topics.

Read on to learn how you can install and use Kafdrop.

Comments closed

Securing Amazon Managed Streaming for Kafka

Stephane Maarek has some security advice for us:

AWS launched IAM Access Control for Amazon MSK, which is a security option offered at no additional cost that simplifies cluster authentication and Apache Kafka API authorization using AWS Identity and Access Management (IAM) roles or user policies to control access. This eliminates the need for administrators to run an unfamiliar system to control access to Apache Kafka on Amazon MSK, and learn intricate details and specific commands to manage Apache Kafka access control lists (ACLs).

This is a game-changer from a security perspective for AWS customers who use Apache Kafka: I recommend Amazon MSK customers use IAM Access Control unless they have a specific need for using mutual TLS or SASL/SCRAM authN/Z.

Read on to see how it works.

Comments closed

Surviving a Kafka Outage

Jakub Korab walks us through availability features in Kafka as well as what to expect if your brokers are unavailable:

In the case of an outage, you have to ensure that these messages can be processed eventually. Keeping unsent messages around and retrying indefinitely in the hopes that the outage will rectify may eventually result in your application running out of memory. This is a crucial consideration in high-throughput applications.

If business functions are performed by systems downstream of Kafka, and the sending application only acts as an ingestion point, the situation is slightly more relaxed. If Kafka is unavailable to send messages to, then no external activity has taken place. For these systems, a Kafka outage might mean that you do not accept new transactions. In such a case, it may be reasonable to return an error message and allow the external third party to retry later. Retail applications typically fall into this category.

Read the whole thing.

Comments closed

Recent Apache NiFi Updates

Pierre Villard has some news for us around Apache NiFi:

Cloudera released a lot of things around Apache NiFi recently! We just released Cloudera Flow Management (CFM) 2.1.1 that provides Apache NiFi on top of Cloudera Data Platform (CDP) 7.1.6. This major release provides the latest and greatest of Apache NiFi as it includes Apache NiFi 1.13.2 and additional improvements, bug fixes, components, etc. Cloudera also released CDP 7.2.9 on all three major cloud platforms, and it also brings Flow Management on DataHub with Apache NiFi 1.13.2 and more.  Let’s have a look at the main highlights of these releases.

Click through to see what’s included.

Comments closed

HDFS Data Encryption at Rest

Arun Kumar Natva takes us through the process of encrypting data at rest in Cloudera Data Platform:

HDFS Encryption delivers transparent end-to-end encryption of data at rest and is an integral part of HDFS. End to end encryption means that the data is only encrypted and decrypted by the client. In other words, data remains encrypted until it reaches the HDFS client.

Each HDFS file is encrypted using an encryption key. To prevent the management of these keys (which can run in the millions) from becoming a performance bottleneck, the encryption key itself is stored in the file metadata. To add another layer of security, the file encryption key is stored in encrypted form, using another “encryption zone key”.

Read on to learn more and to see how it all fits together.

Comments closed

reduceByKey and aggregateByKey in Spark

The Hadoop in Real World team compares two functions against RDDs in Spark:

Let’s examine the below aggregateByKey. The first parameter – 0 is the initial value and also indicates the type of the output.

First _+_  function indicates the function on the map side combine and second _+_ function indicates the reduce side combine. Both functions are the same in this case.

This is a demo-driven post, so check it out.

Comments closed

Apache Kafka 2.8 Released

John Roesler announces Apache Kafka 2.8:

We are excited to announce that 2.8 introduces an early-access look at Kafka without ZooKeeper! The implementation is not yet feature complete and should not be used in production, but it is possible to start new clusters without ZooKeeper and go through basic produce and consume use cases.

At a high level, KIP-500 works by moving topic metadata and configurations out of ZooKeeper and into a new internal topic named @metadata. This topic is managed by an internal Raft quorum of “controllers” and is replicated to all brokers in the cluster. The leader of the Raft quorum serves the same role as the controller in clusters today. A node in the KIP-500 world can serve as a controller, a broker, or both, depending on the new process.roles configuration. See the README for quickstart instructions and additional details.

In addition to the headline item, there are plenty of other bugfixes and additions as well.

Comments closed

Querying Serverless SQL Pools from Spark Notebooks in Scala

Jovan Popovic shows off one integration point between the data services in Azure Synapse Analytics:

Azure Synapse Analytics provides multiple query runtimes that you can use to query in-database or external data. You have the choice to use T-SQL queries using a serverless Synapse SQL pool or notebooks in Apache Spark for Synapse analytics to analyze your data.

You can also connect these runtimes and run the queries from Spark notebooks on a dedicated SQL pool.

In this post, you will see how to create Scala code in a Spark notebook that executes a T-SQL query on a serverless SQL pool.

Read on to see how. You can also query Spark pool and dedicated SQL pool tables from serverless SQL pools.

4 Comments

Removing a Node from a Hadoop Cluster

The Hadoop in Real World team shows us the proper way to remove a node from a Hadoop cluster:

This post will list out the steps to properly remove a node from a Hadoop cluster. It is not advisable to just shut down the node abruptly.

Node exclusions should be properly recorded in a file that is referred to by the property dfs.hosts.exclude. This property doesn’t have default value so in the absence of a file location and a file, the Hadoop cluster will not exclude any nodes.

Read on for more information, including what happens if you simply turn off the node.

Comments closed