Hadoop – Page 3 – Curated SQL

Apache Kafka Consumer Group Strategy

Published 2023-10-03 by Kevin Feasel

Ever dealt with a misbehaving consumer group? Imbalanced broker load? This could be due to your consumer group and partitioning strategy!

Once, on a dark and stormy night, I set myself up for this error. I was creating an application to demonstrate how you can use Apache Kafka® to decouple microservices. The function of my “microservices” was to create latte objects for a restaurant ordering service. It was set up a little like this:

I wanted to implement this in Kafka by using consumers, each reading from a common coffee topic, but with their own partition. Now this was a naive approach. Why?

Click through to learn the reason, as well as some of the mechanics of how consumer groups work.

Comments closed

Tiered Storage in Apache Kafka

Published 2023-10-02 by Kevin Feasel

Matthew de Detrich explains the value behind tiered storage in Apache Kafka:

Tiered Storage is arguably one of the most sought-after features of Kafka 3.6, allowing Kafka’s core data to be stored in other locations, such as object storage, in addition to hard disks in a transparent manner, without any changes to Kafka’s producers or consumers. The Kafka brokers control whether the data is stored on local disks, fast but expensive and limited, or in alternative storage places, such as Amazon S3. When Tiered Storage is properly configured, it means you can have the best of both worlds: recent data is stored on local fast disks (as is currently), and older, less frequently accessed data can be stored elsewhere where it’s cheaper and space requirements are less of a concern (sometimes unlimited!)

Read on to learn more about the official version of tiered storage, as well as a forward port of two prior implementation attempts to Kafka version 3.3.

Comments closed

Parameterizing Databricks Notebooks with Widgets

Published 2023-09-29 by Kevin Feasel

Meagan Longoria adds some widgets:

Widgets provide a way to parameterize notebooks in Databricks. If you need to call the same process for different values, you can create widgets to allow you to pass the variable values into the notebook, making your notebook code more reusable. You can then refer to those values throughout the notebook.

Click through to learn more about the four types of widgets and how they work.

Comments closed

SparkSQL CONCAT vs T-SQL CONCAT

Published 2023-09-27 by Kevin Feasel

Bill Fellows has a public service announcement:

The concat function is super handy in the database world but be aware that the SQL Server one is way better because it solves two problems. It combines everything into a string and it does not require NULL checking. In the before times, one had to down cast to a n/var/char type as well as check for NULL before appending strings via the plus sign.

The point of difference is so important that Bill busted out the marquee HTML tag. Which now leads me to wonder, was marquee or blink the bigger evil in the mid-to-late ’90s web?

Comments closed

Setting a Spark Compute Pool Size in Microsoft Fabric

Published 2023-09-26 by Kevin Feasel

Reitse Eskens manages compute pools:

This next blog won’t be a long one and will probably serve most as a reminder for myself where to find the settings for the Spark compute pool.

When you create a workspace, you get the default starter pool and it has taken me way longer than I care to admit to find where to find the setting and, more importantly, how to change it.

Read on to learn more about how to create a Spark pool of the size you desire. The sizing method is essentially the same as with Azure Synapse Analytics.

Comments closed

Kafka Message Compression

Published 2023-09-22 by Kevin Feasel

Lucia Cerchie and Dave Troiano give us the rundown on compression of individual messages in Apache Kafka:

Topic partitions are the main “unit of parallelism” in Kafka. What’s a unit of parallelism? It’s like having multiple cashiers in the same store instead of one. Multiple purchases can be made at once, which increases the overall amount of purchases made in the same amount of time (this is like throughput). In this case, the cashier is the unit of parallelism.

In Kafka, each partition leader can live on a different broker in a cluster, and a producer can send multiple messages, each with a different destination topic partition; that is, a producer can send them in parallel. While this is the main reason Kafka enables high throughput, compression can also be a tool to help improve throughput and efficiency by reducing network traffic due to smaller messages. A well-executed compression strategy also means better disk utilization in Kafka, since stored messages on disk are smaller.

Click through for the various options and some guidance on using each.

Comments closed

Spark Defaults for Core Count and Memory

Published 2023-09-22 by Kevin Feasel

The Big Data in Real World team gives us the defaults:

spark.executor.cores controls the number of cores available for the executors.

[…]

spark.executor.memory controls the amount of memory allocated for each executor.

I did helpfully take out the first answer, so you’ll have to click through to the post in order to see the answers., as well as how cluster mode vs client mode can change things.

Comments closed

Building a Flink Application in Java

Published 2023-09-20 by Kevin Feasel

Wade Waldron talks about a (free) new course:

Recently, I got my hands dirty working with Apache Flink®. The experience was a little overwhelming. I have spent years working with streaming technologies but Flink was new to me and the resources online were rarely what I needed. Thankfully, I had access to some of the best Flink experts in the business to provide me with first-class advice, but not everyone has access to an expert when they need one.

To share what I learned, I created the Building Flink Applications in Java course on Confluent Developer. It provides you with hands-on experience in building a Flink application from the ground up. I also wrote this blog post to walk through an example of how to do dataflow programming with Flink. I hope these two resources will make the experience less overwhelming for others.

Click through for the blog post and check out the full course if you’re so inclined.

Comments closed

An Overview of Flink SQL

Published 2023-09-18 by Kevin Feasel

Martijn Visser continues a series on Kafka and Flink:

In the first two parts of our Inside Flink blog series, we explored the benefits of stream processing with Flink and common Flink use cases for which teams are choosing to leverage the popular framework to unlock the full potential of streaming. Specifically, we broke down the key reasons why developers are choosing Apache Flink® as their stream processing framework, as well as the ways in which they are putting it into practice. These range from streaming data pipelines to train ML models, to real-time inventory management in retail and predictive maintenance in manufacturing.

Next, we’ll dive into Flink SQL, which is a powerful data processing engine that allows developers to process and analyze large volumes of data in real time. We’ll cover how Flink SQL relates to the other Flink APIs and showcase some of its built-in functions and operations with syntax examples.

I’m naturally predisposed to blog posts which validate Feasel’s Law, so of course I was going to pick this one to recommend.

Comments closed

Generating Tables from Files in Microsoft Fabric via Notebook

Published 2023-09-14 by Kevin Feasel

Dennes Torres performs a bit of ELT:

When Microsoft Fabric was born, the only method to convert files to tables was using notebooks. Nowadays we have an easy-to-use UI feature for the conversion.

As I explained on the article about lakehouse and ETL, there are some scenarios where we still need to use notebooks for the conversion. One of these scenarios is when we need table partitioning.

Let’s make a step-by-step on this blog about how to use notebooks and table partitioning.

Click through to see how it all works.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop