Hadoop – Page 2 – Curated SQL

An Overview of Data Lake Operations with Apache NiFi

Published 2023-11-02 by Kevin Feasel

In the world of data-driven decision-making, ETL (Extract, Transform, Load) processes play a pivotal role. The effective management and transformation of data are essential to ensure that businesses can make informed choices based on accurate and relevant information. Data lakes have emerged as a powerful way to store and analyze massive amounts of data, and Apache NiFi is a robust tool for streamlining ETL processes in a data lake environment.

Read on for a brief primer on NiFi and how some of its capabilities can assist in ETL and ELT processing.

Comments closed

Apache Zookeeper Vulnerability

Published 2023-10-24 by Kevin Feasel

The Instaclustr team reviews an announcement:

On October 11, 2023, the Apache ZooKeeper™ project announced that a security vulnerability has been identified in Apache ZooKeeper, CVE-2023-44981. The Apache ZooKeeper project has classified the severity of this CVE as critical. The CVSS (Common Vulnerability Scoring System) 3.x severity rating for this vulnerability by the NVD (National Vulnerability Database) is base score 9.1 Critical.

That’s a rather high base score and is comes about if you have the setting quorum.auth.enableSasl=true. Updating to the Zookeeper 3.7.2 or alter, 3.8.3 or later, or anything in the 3.9 branch will fix this vulnerability.

Comments closed

Capturing a TCP Dump in an Azure Databricks Notebook

Published 2023-10-20 by Kevin Feasel

Stithi Panigrahi does some troubleshooting:

Due to the potential impact on performance and storage costs, Azure Databricks clusters don’t capture networking logs by default. Follow the below instructions if you need to capture tcpdump to investigate multiple networking issues related to the cluster. These steps will capture a TCP dump on each cluster node–both driver and workers during the entire lifetime of the cluster.

Click through for an initiation script, which generates the actual script, which itself generates the TCP dumps.

Comments closed

Using Data Contracts in Confluent Schema Registry

Published 2023-10-19 by Kevin Feasel

Robert Yokota shows us how to generate data contracts for streaming solutions:

A data contract is a formal agreement between an upstream component and a downstream component on the structure and semantics of data that is in motion. The upstream component enforces the data contract, while the downstream component can assume that the data it receives conforms to the data contract. Data contracts are important because they provide transparency over dependencies and data usage in a streaming architecture. They help to ensure the consistency, reliability, and quality of the data in event streams, and they provide a single source of truth for understanding the data in motion.

Click through for a sample application that uses data contracts.

Comments closed

Running Apache Flink Jobs from HDInsight

Published 2023-10-18 by Kevin Feasel

Sairam Yeturi builds a streaming job:

Could you already complete creating your first Apache Flink® cluster and submit your streaming job on it with HDInsight on AKS?

Well, if you are yet to do that – Let me help you get started.

Click through for a step-by-step walkthrough on how to create a Flink-centric HDInsight cluster on Azure Kubernetes Service and how to create a new job, assuming you have the Jarfile for that job already.

Comments closed

Killing a Running Apache Spark Application

Published 2023-10-13 by Kevin Feasel

The Big Data in Real World team pulls the plug on an application:

Apache Spark is a powerful open-source distributed computing system used for big data processing. However, sometimes you may need to kill a running Spark application for various reasons, such as if the application is stuck, consuming too many resources, or taking too long to complete. In this post, we will discuss how to kill a running Spark application.

Click through to see how you can do this.

Comments closed

Apache Kafka 3.6 Released

Published 2023-10-12 by Kevin Feasel

Satish Duggana announces what’s new in Apache Kafka 3.6:

The ability to migrate Kafka clusters from a ZooKeeper metadata system to a KRaft metadata system is now ready for usage in production environments. See the ZooKeeper to KRaft migration operations documentation for details. Note that support for JBOD is still not available for KRaft clusters, therefore clusters utilizing JBOD can not be migrated. See KIP-858 for details regarding KRaft and JBOD.

Support for Delegation Tokens in KRaft (KAFKA-15219) was completed in 3.6, further reducing the gap of features between ZooKeeper-based Kafka clusters and KRaft. Migration of delegation tokens from ZooKeeper to KRaft is also included in 3.6.

Tiered Storage is an early access feature. It is currently only suitable for testing in non-production environments. See the Early Access Release Notes for more details.

Read on for more details around what’s new in Apache Kafka.

Comments closed

Apache Spark Execution Plan Analysis

Published 2023-10-12 by Kevin Feasel

Karthik Penikalapati digs into Spark SQL explain plans:

In this blog post, we will explore how the Explain Plan can be your secret weapon for debugging and optimizing Spark applications. We’ll dive into the basics and provide clear examples in Spark Scala to help you understand how to leverage this valuable tool.

All I’m saying is, if some company wants to create SQL Sentry Plan Explorer for Apache Spark, I’d be down with it. That loss of an intuitive and powerful graphical interface for execution plans is definitely a point of friction when working with Apache Spark and Spark SQL.

Comments closed

An Intro to Databricks Asset Bundles

Published 2023-10-05 by Kevin Feasel

Dustin Vannoy covers one technique for CI/CD in Databricks:

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.

Click through for a video and some sample scripts.

Comments closed

How Kafka Consumers Keep Track of Position

Published 2023-10-05 by Kevin Feasel

The Big Data in Real World team explains:

Let’s say you have a consumer group which has 3 consumers at the moment consuming messages from a topic. Assume that you had to shut down all 3 consumers in the consumer group for some reason. Now when you restart the consumers in the consumer group, how does the consumers know from which offset they should read from the topic to avoid reading the same messages all over again which were already read before the consumers went down?

Read on for the answer.

Comments closed

Category: Hadoop