Hadoop – Page 6 – Curated SQL

Killing Multiple YARN Applications at Once

Published 2023-06-07 by Kevin Feasel

The Big Data in Real World team doesn’t have time to mess around:

If you work with Apache Hadoop, you may find yourself needing to kill multiple YARN applications at once. While you can kill them one by one using the yarn application -kill command, this can be a tedious and time-consuming process. Fortunately, there is a faster way to kill multiple YARN applications at once using the yarn application command in combination with awk.

Click through to see how. I will say, though, remembering some of these sed+grep+awk solutions I’ve written in the past makes me happy that Powershell is object-based…

Comments closed

Listing Topics in Kafka without Zookeeper

Published 2023-06-01 by Kevin Feasel

The BIg Data in Real World team has a quick one for us:

Kafka uses Zookeeper to manage it’s internal state. So it is not possible to run Kafka without Zookeeper. Even if you don’t have access to Zookeeper in your organization, there is a Zookeeper cluster running which your Kafka cluster connects to.

So, how to list topics and execute other commands if we don’t have access to Zookeeper?

Eventually, this won’t even be a question, as Kafka already has production versions using KRaft, and by Kafka 4.0, there won’t be a Zookeeper to kick around anymore.

Comments closed

Databricks SQL Performance Tuning

Published 2023-05-30 by Kevin Feasel

Katie Cummiskey provides some tips for us:

We previously discussed how to use Power BI on top of Databricks Lakehouse efficiently. However, the well-designed and efficient Lakehouse itself is the basement for overall performance and good user experience. We will discuss recommendations for physical layout of Delta tables, data modeling, as well as recommendations for Databricks SQL Warehouses.

These tips and techniques proved to be efficient based on our field experience. We hope you will find them relevant for your Lakehouse implementations too.

Read on for these tips.

Comments closed

Adding Count to a Grouped DataFrame in Spark

Published 2023-05-25 by Kevin Feasel

The Big Data in Real World team does some counting:

We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.

One important thing to remember about Spark transformations is that they’re lazy: just because you ran df.groupBy(...).agg(...) doesn’t mean the new DataFrame exists yet, so until you call the show() action (or whatever), the original data is still there for the taking, which is how you can reference it again later in the chained statement.

Comments closed

Contrasting Kafka and Pulsar

Published 2023-05-25 by Kevin Feasel

Tessa Burk perform a comparson:

Apache Kafka® and Apache Pulsar™ are 2 popular message broker software options. Although they share certain similarities, there are big differences between them that impact their suitability for various projects.

In this comparison guide, we will explore the functionality of Kafka and Pulsar, explain the differences between the software, who would use them, and why.

Click through for that comparison. I haven’t used Pulsar before, so it’s interesting to get this sort of a functionality and community comparison.

Comments closed

Building Your First Spark SQL Application

Published 2023-05-22 by Kevin Feasel

Dustin Vannoy has a new video for us:

Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own application with Apache Spark.

Click through for the video and sample code.

Comments closed

Query Snowflake Data from Spark

Published 2023-05-18 by Kevin Feasel

The Big Data in Real World team crosses data platforms:

If your organization is working with lots of data you might be leveraging Spark to compute distribution. You could also potentially have some or all your data in a Snowflake data warehouse.

In a situation like this, you might have to expose data in Snowflake to the processes that run on Spark. This is made possible using the Spark Connector for Snowflake.

In this post, we will see what is Spark connector for Snowflake and how to use it from Spark to connect to Snowflake and access data from Snowflake in your Spark cluster.

Read on for a high-level architecture of how it works and the configuration you’ll need to do to get it running.

Comments closed

Databricks SQL in VSCode

Published 2023-05-15 by Kevin Feasel

Falek Miah tries out an extension:

Recently, I had the opportunity to explore the Databricks SQL extension for VSCode, and I was thoroughly impressed.

In December 2022, Databricks launched the Databricks Driver for SQLTools extension, and although it is still in preview, the features are already good and useful.

For data analysts, report developers and data engineers, having the ability to execute SQL queries against Databricks workspace objects is crucial for streamlining workflows and making data analysis activities much more efficient and quicker. The Databricks SQL extension for VSCode provides just that, with a simple and intuitive interface, this extension makes it easy to connect to Databricks workspace and run SQL queries directly from VSCode.

Click through for Falek’s thoughts. And if Databricks SQL is brand new to you, Falek also has a primer on it.

Comments closed

Kafka Broker Not Available on 127.0.0.1

Published 2023-05-09 by Kevin Feasel

The Big Data in Real World team expands beyond localhost:

You might encounter the below error message when your Kafka consumers or clients connect to Kafka broker for ingestion or consumption.
Connection to node -1 (/127.0.0.1:9092) could not be established. Broker may not be available.

Read on to see what the problem is and how you can fix it.

Comments closed

Speeding Up a Slow Kafka Consumer with Parallelism

Published 2023-05-08 by Kevin Feasel

Paul Brebner continues a series on Kafka consumers:

In Part 1 of this series, we had a look at Kafka concurrency and throughput work, recapped some earlier approaches I used to improve Kafka performance, and introduced the Kafka Parallel Consumer and supported ordering options (Partition, Key, and Unordered). In this second part we continue our investigations with some example code, a trace of a “slow consumer” example, how to achieve 1 million TPS in theory, some experimental results, what else do we know about the Kafka Parallel Consumer, and finally, if you should use it in production.

Read on to see what Paul has to say about the topic.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Hadoop