Press "Enter" to skip to content

Category: Hadoop

Spark Query Optimization in Synapse

Daniel Coelho lays out a few optimizations in Azure Synapse Analytics Spark pools:

The Azure Synapse Analytics team has prominent engineers enhancing and contributing back to the Apache Spark project. One of our focus areas is Spark query optimization techniques, where Microsoft has decades of experience and is making significant contributions to the Apache Spark open source engine.

The attachment at the bottom of this blog post will be presented at the 48th International Conference on Very Large Databases (#VLDB2022) and covers the latest developments in query optimization for Apache Spark 3. Those optimizations were developed by Microsoft engineers and are available today in the Azure Synapse runtime for Apache Spark versions 3.1 and 3.2.

Check out the high-level updates as well as a complete technical paper laying out the changes.

Comments closed

Airflow and Akka

Chesnay Schepler responds to an announcement:

On September 7th Lightbend announced a license change for the Akka project, the TL;DR being that you will need a commercial license to use future versions of Akka (2.7+) in production if you exceed a certain revenue threshold.

Within a few hours of the announcement several people reached out to the Flink project, worrying about the impact this has on Flink, as we use Akka internally.

The purpose of this blogpost is to clarify our position on the matter.

Read on for what this means for Apache Flink.

Comments closed

Replacing ZooKeeper with Kafka Raft

Dave Shook shows off the Kafka Raft protocol:

Apache Kafka® Raft (KRaft) is the consensus protocol that was introduced to remove Apache Kafka’s dependency on ZooKeeper™ for metadata management. This greatly simplifies Kafka’s architecture by consolidating responsibility for metadata into Kafka itself, rather than splitting it between two different systems: ZooKeeper and Kafka. KRaft mode is available in the Apache Kafka 3.1 release, though it is not yet ready for use in production environments. Refer to KIP-833 to learn more about when KRaft will be marked as production ready.

Below are several key resources to help you learn everything you need to know about the ins and outs of KRaft, its role in the Kafka architecture, and how you can get started in trying it out. These resources are followed by two others that let you “do stuff” with Kraft, i.e., run a KRaft mode cluster and observe how it works when various cluster controller related operations take place.

Eliminating the ZooKeeper dependency was a big goal for the Kafka team for several years. It’s not quite out yet but I’ll be interested to see how that migration works for companies.

Comments closed

Creating Multiple Output Files per Spark Task

Dmitry Tolpeko has a quick but helpful post:

It is highly recommended that you try to evenly distribute the work among multiple tasks so every task produces a single output file and job is completed in parallel.

But sometimes it still may be useful when a task generates multiple output files with the limited number of records in each file […]

I had to cut it off right there to keep from spilling the beans here. Click through for Dmitry’s post to see what setting controls records per file, allowing you to keep opening those Spark output files in Excel.

Comments closed

Apache Flink Updates

Danny Cranmer announces Flink 1.15.2:

The Apache Flink Community is pleased to announce the second bug fix release of the Flink 1.15 series.

This release includes 30 bug fixes, vulnerability fixes, and minor improvements for Flink 1.15. Below you will find a list of all bugfixes and improvements (excluding improvements to the build infrastructure and build stability). For a complete list of all changes see: JIRA.

We highly recommend all users upgrade to Flink 1.15.2.

In addition to that, Jingsong Lee announces Flink Table Store 0.2.0:

Flink Table Store is a data lake storage for streaming updates/deletes changelog ingestion and high-performance queries in real time.

As a new type of updatable data lake, Flink Table Store has the following features:

– Large throughput data ingestion while offering good query performance.

– High performance query with primary key filters, as fast as 100ms.

– Streaming reads are available on Lake Storage, lake storage can also be integrated with Kafka to provide second-level streaming reads.

Read on for the changes in both platforms.

Comments closed

Overly Large Executors in ElasticMapReduce

Dmitry Tolpeko notes a change to Amazon ElasticMapReduce:

So 50 executors were initially requested with the required memory 22528 and 4 vcores as expected, but actually 9 executors were created with 112640 memory and 20 cores that is 5x larger. It should have created 10 executors but my cluster does not have resources to run more containers.

Note: The second log row specifies allocated vCores:5, it is because of using DefaultResourceCalculator in my YARN cluster that ignores CPU and uses memory resource only. Do not pay attention to this, the Spark executor will still use 20 cores as it reported in the third log record above.

Click through for the reason.

Comments closed

Watermarking in Spark Structured Streaming

Max Fisher takes us through an important feature for Spark streaming:

When building real-time pipelines, one of the realities that teams have to work with is that distributed data ingestion is inherently unordered. Additionally, in the context of stateful streaming operations, teams need to be able to properly track event time progress in the stream of data they are ingesting for the proper calculation of time-window aggregations and other stateful operations. We can solve for all of this using Structured Streaming.

For example, let’s say we are a team working on building a pipeline to help our company do proactive maintenance on our mining machines that we lease to our customers. These machines always need to be running in top condition so we monitor them in real-time. We will need to perform stateful aggregations on the streaming data to understand and identify problems in the machines.

This is where we need to leverage Structured Streaming and Watermarking to produce the necessary stateful aggregations that will help inform decisions around predictive maintenance and more for these machines.

Read on to see how watermarking works in various scenarios, including when you join together streams.

Comments closed

Securing Kafka Streams

Amani Newton gives us a primer on Apache Kafka security:

The largest companies in the world use Apache Kafka® for their real-time streaming data pipelines and applications. Kafka is the basis for the real-time fraud text alerts from your bank and the network-connected medical devices used in your local hospital. Securing customer or patient data as it flows through the Kafka system is crucial. However, out of the box, Kafka has relatively little security enabled. This blog post previews the free Confluent Developer course that teaches the basics of securing your Apache Kafka-based system.

Click through for the overview.

Comments closed

Useful Design Patterns for Apache Spark Projects

Alexander Eleseev applies some design patterns:

When I participated in a big data project, I needed to program Spark applications to move and transform data from/to relational and distributed databases, like Apache Hive. I found such applications to have a number of pitfalls, so all “hard to read code,” “method is too large to fit into a single screen,” etc. problems need to be avoided for us to focus on deeper issues. Also, Spark jobs are similar: data is loaded from a single or multiple databases, gets transformed, then saved to a single or multiple databases. So it seems reasonable to try to use GoF patterns to program Spark applications. 

Specifically, this covers Spark code written in Java (or Python). I’d argue that Scala-based code would profit by following a different set of functional patterns rather than Gang of Four object-oriented design patterns.

Comments closed

From Kafka to Azure Data Explorer with Protobuf Data

Anshul Sharma and Ramachandran G do a bit of converting:

Kafka is increasingly become a popular choice of scalable message queueing for large data processing workloads. This makes it very popular in IoT based ecosystem where there is large ingress in data before data processing (or) data storage. Azure Data Explorer  is a very powerful time series and analytics database that suits IoT scale data ingestion and data querying.  

Kafka supports ingestion of data in multiple formats including JSON, Avro, Protobuf and String. ADX supports ingestion of data from Kafka into ADX in all these formats. Due to excellent schema support, extensibility to various platforms and compression, [protobuf](https://developers.google.com/protocol-buffers) is increasingly becoming a data exchange choice in IoT based systems. The ADX Kafka sink connector leverages the Kafka Connect framework and provides an adapter to ingest data from Kafka in all these formats. 

The following section aims to provide configuration to support ingestion of protobuf data from Kafka to ADX. 

Click through for the high-level architecture and a deeper dive into the process.

Comments closed