Press "Enter" to skip to content

Category: Hadoop

Error Handling Patterns in Kafka

Gerardo Villeda gives a few options for handling errors in an Apache Kafka topic:

Apache Kafka® applications run in a distributed manner across multiple containers or machines. And in the world of distributed systems, what can go wrong often goes wrong. This blog post covers different ways to handle errors and retries in your event streaming applications. The nature of your process determines the patterns, and more importantly, your business requirements.

This blog provides a quick guide on some of those patterns and expands on a common and specific use case where events need to be retried following their original order. This blog post illustrates a scenario of an application that consumes events from one topic, transforms those events, and produces an output to a target topic, covering different approaches as they gradually increase in complexity.

Click through for the list. Each explanation is pretty short, but opens the door for further analysis.

Comments closed

Azure Synapse Analytics Supports Apache Spark 3.0

Euan Garden has some great news for us:

Starting today, the Apache Spark 3.0 runtime is now available in Azure Synapse. This version builds on top of existing open source and Microsoft specific enhancements to include additional unique improvements listed below. The combination of these enhancements results in a significantly faster processing capability than the open-source Spark 3.0.2 and 2.4.

The public preview announced today starts with the foundation based on the open-source Apache Spark 3.0 branch with subsequent updates leading up to a Generally Available version derived from the latest 3.1 branch.

It still won’t be as fast as Databricks, but it should be a good bit faster than the Spark 2 they were running.

Comments closed

Broadcast Variables in Apache Spark

The Hadoop in Real World team explains the notion of broadcast variables in Apache Spark:

Broadcast variables are variables which are available in all executors executing the Spark application. These variables are already cached and ready to be used by tasks executing as part of the application. Broadcast variables are sent to the executors only once and it is available for all tasks executing in the executors.

Read on to understand when they are useful and, just as importantly, when not to use them. They seem like the type of thing which a newer developer could easily misuse.

Comments closed

Understanding Consumer Lag in Apache Kafka

Loretta Jones takes us through the notion of consumer lag in an Apache Kafka topic:

Amongst various metrics that Kafka monitoring includes consumer lag is nearly the most important of them all. In this post, we will explore potential reasons for Kafka consumer lag and what you could do when you experience lag.

This post is fairly high-level, and it does a good job of explaining the notion to someone without much familiarity with Kafka.

Comments closed

Data Pipeline Error Handling with Apache NiFi

Pieter Humphrey gives us a few techniques for handling data pipeline errors when running Apache NiFi:

The more complex the model, the more possible sources of problems exist. Forecasting every single potential problem is, of course, impossible. Identifying the most important ones and providing self-solving solutions can greatly reduce the operational uncertainty of our NiFi pipeline and improve its robustness.

To see how to do this analysis, we will consider four possible strategies: one external and three internal. They certainly do not cover all potential error scenarios, they are just examples that we can extrapolate from, and inform how to handle other potential failure domains.

Click through for an overview of the topic as well as those four techniques.

Comments closed

Learning the Basics of Kafka via Notebook

Francesco Tisiot shares a way to learn about the basics of Apache Kafka using Jupyter notebooks:

One of the best ways to learn a new technology is to try it within an assisted environment that anybody can replicate and get working within few minutes. Notebooks represent an excellence in this field by allowing people to share and use pre-built content which includes written descriptions, media and executable code in a single page.

This blog post aims to teach you the basics of Apache Kafka Producers and Consumers through building an interactive notebook in Python. If you want to browse a full ready-made solution instead, check out our dedicated github repository.

The classic tutorials tend to use a couple command prompts and the built-in producer and consumer shell scripts. I like this approach as a way of being able to review the code and results later as a refresher.

Comments closed

Resetting a Consumer Offset in Kafka

The Hadoop in Real World team shows how to update the consumer offset in Kafka:

In some scenarios, consumers which consumed the messages from a Kafka partition could have resulted in errors and the consumption would have been incomplete. In such cases of consumption failures you may have a need to re-consume the messages which were previously consumed. In such instances you would have to reset the consumer offset to an earlier offset.

This is one of the big advantages of a log-based message broker versus a queue: if you find a bug in a downstream consumer, it’s easy to generate correct results after fixing the bug, something which can be much harder to do otherwise.

Comments closed

Building a Kafka Test Environment with Kafdrop

Diogo Souza walks us through an interesting project:

From a daily life standpoint, it’s challenging to manage Kafka brokers, partitions, topics, producers, and consumers all via command line. An interface would be quite helpful.

There is a ton of available options for managing your Kafka brokers for web UI applications. Perhaps Confluent’s version is one of the most complete, although it is part of a paid combo for mostly enterprise means.

Amongst the myriad of open-source options, Kafdrop stands out for being simple, fast, and easy to use. It is an open-source web project that allows you to view information from Kafka brokers as existing topics, consumers, and even the content of messages sent.

This article explores creating a more flexible test environment to work alongside the .NET app built in the previous article. This way, you’ll have more powerful tools to understand what’s happening with your topics.

Read on to learn how you can install and use Kafdrop.

Comments closed