Hadoop – Page 11 – Curated SQL

Optimizing Async Sinks in Flink

Published 2022-11-29 by Kevin Feasel

When designing a Flink data processing job, one of the key concerns is maximising job throughput. Sink throughput is a crucial factor because it can determine the entire job’s throughput. We generally want the highest possible write rate in the sink without overloading the destination. However, since the factors impacting a destination’s performance are variable over the job’s lifetime, the sink needs to adjust its write rate dynamically. Depending on the sink’s destination, it helps to tune the write rate using a different RateLimitingStrategy.

This post explains how you can optimise sink throughput by configuring a custom RateLimitingStrategy on a connector that builds on the AsyncSinkBase (FLIP-171). In the sections below, we cover the design logic behind the AsyncSinkBase and the RateLimitingStrategy, then we take you through two example implementations of rate limiting strategies, specifically the CongestionControlRateLimitingStrategy and TokenBucketRateLimitingStrategy.

Read on for some tips on creating a rate limiting strategy for a sink.

Comments closed

Spark RDD Transformations

Published 2022-11-23 by Kevin Feasel

Meenakshi Goyal walks us through the transformation functions available to you when using a Spark RDD:

The role of transformation in Spark is to create a new dataset from an existing one. Lazy transformations are those that are computed only when an action requires a result to be returned to the driver programme.

When we call an action, transformations are executed since they are inherently lazy. Not right away are they carried out. There are two primary types of transformations: map() and filter ().
The outcome RDD is always distinct from the parent RDD after the transformation. It could be smaller (filter, count, distinct, sample, for example), bigger (flatMap(), union(), Cartesian()), or the same size (e.g. map).

Read on to learn more about transformations, including examples of how each works. Even if you’re using the DataFrames API for Spark, it’s still important to understand that transformations are lazy.

Comments closed

REST APIs for Synapse Spark Pools

Published 2022-11-23 by Kevin Feasel

Abid Nazir Guroo looks at some endpoints:

Azure Synapse Analytics Representational State Transfer (REST) APIs are secure HTTP service endpoints that support creating and managing Azure Synapse resources using Azure Resource Manager and Azure Synapse web endpoints. This article provides instructions on how to setup and use Synapse REST endpoints and describe the Apache Spark Pool operations supported by REST APIs.

Read on to see some of the Spark pool management options are available to you via the REST API.

Comments closed

Diagnosing Customer Rebalance Time Issues in Kafka

Published 2022-11-17 by Kevin Feasel

Danica Fine and Nikoleta Verbeck continue a series on Kafka performance troubleshooting:

At the surface, rebalancing seems simple. The number of consumers in the consumer group is changing, so the subscribed topic-partitions must be redistributed, right? Yes, but there’s a bit more going on under the hood and this changes depending on what kind of rebalancing is taking place.

Read on to learn more about how Kafka performs rebalancing and what might affect performance.

Comments closed

Increased Response Rate and Request Time in Kafka

Published 2022-11-16 by Kevin Feasel

Danica Fine and Nikoleta Verbeck troubleshoot another common Apache Kafka issue:

It can be easy to go about life without thinking about them, but requests are an important part of Kafka; they form the basis of how clients (both producers and consumers) interact with data as it moves into and out of Kafka topics, and, in certain cases, too many requests can have a negative impact on your brokers. To understand how requests can affect the brokers, it’s important to be familiar with what happens under the hood when a request is made.

Read on to see how the process works under the covers, what kinds of metrics you can use to determine how well things are going, and what might be going wrong if you see certain symptoms.

Comments closed

Time Travel with Delta Tables in Synapse

Published 2022-11-11 by Kevin Feasel

Liliam Leme reverses the clock:

Scenario

While working with a customer, they had a requirement to restore modified files to a specific point in time. They had built their architecture on top of a Data lake.

Looking for options

While working on this scenario, we explored some storage options available without any side customization, for example, Soft delete for blobs – Azure Storage | Microsoft Docs.

Read on to see what they landed on.

Comments closed

Practical Results of a ZooKeeper-less Kafka

Published 2022-11-09 by Kevin Feasel

Paul Brebner does the math:

The Kafka cluster meta-data is now only stored in the Kafka cluster itself, making meta-data update operations faster and more scalable. The meta-data is also replicated to all the brokers, making failover from failure faster too. Finally, the active Kafka controller is now the Quorum Leader, using Raft for leader election.

The motivation for giving Kafka a “brain transplant” (replacing ZooKeeper with KRaft) was to fix scalability and performance issues, enable more topics and partitions, and eliminate the need to run an Apache ZooKeeper cluster alongside every Kafka cluster.

Read on for some initial testing of KRaft versus ZooKeeper.

Comments closed

Diagnosing Kafka Message Throughput Reductions

Published 2022-11-09 by Kevin Feasel

Danica Fine and Nikoleta Verbeck troubleshoot an issue:

One of the greatest advantages of Kafka is its ability to maintain high throughput of data. Unsurprisingly, high throughput starts with the producers. Prior to sending messages off to the brokers, individual records destined for the same topic-partition are batched together as a single compressed collection of bytes. These batches are then further aggregated before being sent to the destination broker.

Batching is a great thing, and we (generally) want it. But how do you know when it’s working well and when it’s not?

This first post covers message throughput but there will be several other topics in the series as well.

Comments closed

Securing a Kafka Cluster

Published 2022-11-08 by Kevin Feasel

Dan Weston aims to secure an Apache Kafka cluster:

As part of our educational resources, Confluent Developer now offers a course designed to help you apply Confluent Cloud’s security features to meet the privacy and security needs of your organization. This blog post explores the need to implement security for your Apache Kafka® cluster, then briefly reviews the security features and advantages of using Confluent Cloud.

Click through for an overview. The course itself is free, as well.

Comments closed

Azure Synapse Analytics R Language Support

Published 2022-11-01 by Kevin Feasel

Ryan Majidimehr has a short list of updates for Azure Synapse Analytics but it includes a big one:

Azure Synapse Analytics provides built-in R support for Apache Spark. As part of this, data scientists can leverage Azure Synapse Analytics notebooks to write and run their R code. This also includes support for SparkR and SparklyR, which allows users to interact with Spark using familiar Spark or R interfaces. To learn more read the official how-to Use R for Apache Spark with Azure Synapse Analytics (Preview).

That it took this long for R support was a bit weird, but I’m glad it’s there now.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Category: Hadoop