Press "Enter" to skip to content

Category: Hadoop

Capturing a TCP Dump in an Azure Databricks Notebook

Stithi Panigrahi does some troubleshooting:

Due to the potential impact on performance and storage costs, Azure Databricks clusters don’t capture networking logs by default. Follow the below instructions if you need to capture tcpdump to investigate multiple networking issues related to the cluster. These steps will capture a TCP dump on each cluster node–both driver and workers during the entire lifetime of the cluster.

Click through for an initiation script, which generates the actual script, which itself generates the TCP dumps.

Comments closed

Using Data Contracts in Confluent Schema Registry

Robert Yokota shows us how to generate data contracts for streaming solutions:

A data contract is a formal agreement between an upstream component and a downstream component on the structure and semantics of data that is in motion. The upstream component enforces the data contract, while the downstream component can assume that the data it receives conforms to the data contract. Data contracts are important because they provide transparency over dependencies and data usage in a streaming architecture. They help to ensure the consistency, reliability, and quality of the data in event streams, and they provide a single source of truth for understanding the data in motion.

Click through for a sample application that uses data contracts.

Comments closed

Running Apache Flink Jobs from HDInsight

Sairam Yeturi builds a streaming job:

Could you already complete creating your first Apache Flink® cluster and submit your streaming job on it with HDInsight on AKS?

Well, if you are yet to do that – Let me help you get started.

Click through for a step-by-step walkthrough on how to create a Flink-centric HDInsight cluster on Azure Kubernetes Service and how to create a new job, assuming you have the Jarfile for that job already.

Comments closed

Killing a Running Apache Spark Application

The Big Data in Real World team pulls the plug on an application:

Apache Spark is a powerful open-source distributed computing system used for big data processing. However, sometimes you may need to kill a running Spark application for various reasons, such as if the application is stuck, consuming too many resources, or taking too long to complete. In this post, we will discuss how to kill a running Spark application.

Click through to see how you can do this.

Comments closed

Apache Kafka 3.6 Released

Satish Duggana announces what’s new in Apache Kafka 3.6:

The ability to migrate Kafka clusters from a ZooKeeper metadata system to a KRaft metadata system is now ready for usage in production environments. See the ZooKeeper to KRaft migration operations documentation for details. Note that support for JBOD is still not available for KRaft clusters, therefore clusters utilizing JBOD can not be migrated. See KIP-858 for details regarding KRaft and JBOD.

Support for Delegation Tokens in KRaft (KAFKA-15219) was completed in 3.6, further reducing the gap of features between ZooKeeper-based Kafka clusters and KRaft. Migration of delegation tokens from ZooKeeper to KRaft is also included in 3.6.

Tiered Storage is an early access feature. It is currently only suitable for testing in non-production environments. See the Early Access Release Notes for more details.

Read on for more details around what’s new in Apache Kafka.

Comments closed

Apache Spark Execution Plan Analysis

Karthik Penikalapati digs into Spark SQL explain plans:

In this blog post, we will explore how the Explain Plan can be your secret weapon for debugging and optimizing Spark applications. We’ll dive into the basics and provide clear examples in Spark Scala to help you understand how to leverage this valuable tool.

All I’m saying is, if some company wants to create SQL Sentry Plan Explorer for Apache Spark, I’d be down with it. That loss of an intuitive and powerful graphical interface for execution plans is definitely a point of friction when working with Apache Spark and Spark SQL.

Comments closed

An Intro to Databricks Asset Bundles

Dustin Vannoy covers one technique for CI/CD in Databricks:

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.

Click through for a video and some sample scripts.

Comments closed

How Kafka Consumers Keep Track of Position

The Big Data in Real World team explains:

Let’s say you have a consumer group which has 3 consumers at the moment consuming messages from a topic. Assume that you had to shut down all 3 consumers in the consumer group for some reason. Now when you restart the consumers in the consumer group, how does the consumers know from which offset they should read from the topic to avoid reading the same messages all over again which were already read before the consumers went down?

Read on for the answer.

Comments closed

Apache Kafka Consumer Group Strategy

Lucia Cerchie gives us some advice:

Ever dealt with a misbehaving consumer group? Imbalanced broker load? This could be due to your consumer group and partitioning strategy! 

Once, on a dark and stormy night, I set myself up for this error. I was creating an application to demonstrate how you can use Apache Kafka® to decouple microservices. The function of my “microservices” was to create latte objects for a restaurant ordering service. It was set up a little like this:

I wanted to implement this in Kafka by using consumers, each reading from a common coffee topic, but with their own partition. Now this was a naive approach. Why? 

Click through to learn the reason, as well as some of the mechanics of how consumer groups work.

Comments closed

Tiered Storage in Apache Kafka

Matthew de Detrich explains the value behind tiered storage in Apache Kafka:

Tiered Storage is arguably one of the most sought-after features of Kafka 3.6, allowing Kafka’s core data to be stored in other locations, such as object storage, in addition to hard disks in a transparent manner, without any changes to Kafka’s producers or consumers. The Kafka brokers control whether the data is stored on local disks, fast but expensive and limited, or in alternative storage places, such as Amazon S3. When Tiered Storage is properly configured, it means you can have the best of both worlds: recent data is stored on local fast disks (as is currently), and older, less frequently accessed data can be stored elsewhere where it’s cheaper and space requirements are less of a concern (sometimes unlimited!) 

Read on to learn more about the official version of tiered storage, as well as a forward port of two prior implementation attempts to Kafka version 3.3.

Comments closed