Hadoop – Page 19 – Curated SQL

Our customers use Apache Hive on Amazon EMR for large-scale data analytics and extract, transform, and load (ETL) jobs. Amazon EMR Hive uses Apache Tez as the default job execution engine, which creates Directed Acyclic Graphs (DAGs) to process data. Each DAG can contain multiple vertices from which tasks are created to run the application in parallel. Their final output is written to Amazon Simple Storage Service (Amazon S3).
Hive initially writes data to staging directories and then move it to the final location after a series of rename operations. This design of Hive renames supports task failure recovery, such as rescheduling the failed task with another attempt, running speculative execution, and recovering from a failed job attempt. These move and rename operations don’t have a significant performance impact in HDFS because it’s only a metadata operation when compared to Amazon S3 where the performance can degrade significantly based on the number of files written.
This post discusses the new optimized committer for Hive in Amazon EMR and also highlights its impressive performance by running a TPCx-BB performance benchmark and comparing it with the Hive default commit logic.

Read on for a description of how commit operations work in general and how the updated Hive committer can help with certain types of queries.

Comments closed

Right to Be Forgotten in Delta Lake

Published 2022-03-30 by Kevin Feasel

Milos Colic, et al, tackle a tricky problem:

With Delta, we have one more tool at our disposal to address GDPR compliance and, in particular, “the right to be forgotten” – VACUUM. Vacuum operation removes the files that are no longer needed and that are older than a predefined retention period. The default retention period is 30 days to align with GDPR definition of undue delay. Our earlier blog on a similar topic explains in detail how you can find and delete personal information related to a consumer by running two commands:

The part I’m finding tricky here is, how does this handle “time travel” scenarios in which you’re looking at prior iterations of data? I haven’t run through all of the scenarios so this is just speculation, but it seems that even with all of these changes, you’d still have to worry about historical data containing that sensitive information.

Comments closed

Writing a Single JSON File in Databricks

Published 2022-03-29 by Kevin Feasel

Falek Miah performs a surprisingly difficult task:

When writing to a JSON destination using the DataFrameWriter the dataset is split into multiple files to reflect the number of RDD partitions in the dataframe when in memory – this is the most efficient way for Spark to write data out.
However, this creates a directory containing the data files, as well as Spark metadata files…but what if you just wanted a single JSON file? It’s a scenario that comes up a lot with our clients and, despite it not being the most efficient way to use Spark, we need to implement it all the same.

Click through to see how to do this, including the removal of all metadata files (committed, started, and success files).

Comments closed

Intelligent Cache for Spark in Synapse

Published 2022-03-29 by Kevin Feasel

Avinanda Chattapadday makes an announcement:

Traditionally, when querying a file or table from your data lake, the Apache Spark engine in Synapse makes a call to your remote ADLS Gen2 storage for each read of the data. For workloads with frequent repeat queries, this process can be redundant and add latency to the overall processing time. Although Apache Spark provides a great caching feature, it must be manually set and released to minimize the latency and improve overall performance. It can also result in queries of stale data if the underlying data changes. This is where the intelligent cache in Azure Synapse can simplify the process; by automatically detecting changes to the underlying files and automatically refreshing them in the cache, you ensure you have access to the most recent data. When the cache reaches its size limit, it will automatically release the least-read data to make space for more recent data.

Click through to see how you can enable this, as well as a few more details on the process.

Comments closed

Replacing Zookeeper in Kafka

Published 2022-03-28 by Kevin Feasel

Guozhang Wang explains the decision-making behind a major change in Apache Kafka:

Why replace ZooKeeper with an internal log for Apache Kafka^® metadata management? This post explores the rationale behind the replacement, examines why a quorum-based consensus protocol like Raft was utilized and altered to become KRaft, and describes the new Quorum Controller built on top of KRaft protocols.

Click through for the reasoning, which includes a considerably faster shutdown in large environments..

Comments closed

Unpivoting in Spark with Stack()

Published 2022-03-17 by Kevin Feasel

The Hadoop in Real World team does some stacking:

stack function in Spark takes a number of rows as an argument followed by expressions.
stack(n, expr1, expr2.. exprn)
stack function will generate n rows by evaluating the expressions.

Think of it as the Spark alternative to UNPIVOT.

Comments closed

Making Kafka Clients Faster

Published 2022-03-11 by Kevin Feasel

Yeva Byzek has a few whitepaper recommendations:

Over the years, incredible technical content has been written about data plane performance, general principles and tradeoffs, cloud-native architectures, etc. These writings describe how you can get low latency and high throughput without compromising on a mature and reliable platform that provides persistence, no data loss, audit logs, processing logs, and more—all the things that enable you to go from proof of concept to production. This blog post highlights the top five reading recommendations to help you gain a deeper understanding of what makes applications that run on Confluent Cloud so fast. They cover the key concepts and provide concrete examples of how we do it, and how you can do it too, with specific benchmark testing and configuration guidelines.

Click through for links to those resources.

Comments closed

Map and FlatMap in Spark

Published 2022-03-10 by Kevin Feasel

The Hadoop in Real World team maps some knowledge:

Let’s look at map() first. map() transforms and RDD with N elements to RDD with N elements. Important thing to note is each element is transformed into another element there by the resultant RDD will have the same elements as before.

Click through to see how map() and flatMap() differ.

Comments closed

De-Scalafication in Flink

Published 2022-02-23 by Kevin Feasel

Seth Wiesman has a post leaving me feeling a little bittersweet:

If you have worked with a JVM-based application, you have probably heard the term classpath. The classpath defines where the JVM will search for a given classfile when it needs to be loaded. There may only be one instance of a classfile on each classpath, forcing any dependency Flink exposes onto users. That is why the Flink community works hard to keep our classpath “clean” – or free of unnecessary dependencies. We achieve this through a combination of shaded dependencies, child first class loading, and a plugins abstraction for optional components.
The Apache Flink runtime is primarily written in Java but contains critical components that forced Scala on the default classpath. And because Scala does not maintain binary compatibility across minor releases, this historically required cross-building components for all versions of Scala. But due to many reasons – breaking changes in the compiler, a new standard library, and a reworked macro system – this was easier said than done.

They did it, which means less Scala in the code base. But it also means that you aren’t tied to a particular version of Scala in your own code. I’m happy about it on the whole but it does expose a frustrating pain point with Scala.

Comments closed

Getting Started with the Databricks Feature Store

Published 2022-02-21 by Kevin Feasel

Gavita Regunath gives us an introduction to a useful Databricks feature:

Databricks announced the launch of the Databricks Feature Store last year, in May 2021. It is the first of its kind that has been co-designed with Delta Lake and MLflow to accelerate ML deployments.
In this article, we will leverage Databricks Feature Store to store features, create a training dataset by looking up relevant features, and subsequently train an ML model. Follow this step-by-step guide to get started on Databricks Feature Store.

Click through to learn more.

Comments closed

Category: Hadoop

Zero-Rename Writes in ElasticMapReduce Hive

Right to Be Forgotten in Delta Lake

Writing a Single JSON File in Databricks

Intelligent Cache for Spark in Synapse

Replacing Zookeeper in Kafka

Unpivoting in Spark with Stack()

Making Kafka Clients Faster

Map and FlatMap in Spark

De-Scalafication in Flink

Getting Started with the Databricks Feature Store