Spark – Page 10 – Curated SQL

Creating a Spark UDF

Published 2023-06-13 by Kevin Feasel

The Big Data in Real World team creates a Spark user-defined function in Scala:

In this post we are going to create a Spark UDF which converts temperature from Fahrenheit to Celsius.

Here is our data. We have day and temperature in Fahrenheit.

And, of course, it’s roughly the same in PySpark. Also, note that user-defined function performance will take a hit, and that answers are fairly consistent through the years, so save these for when you need them.

Comments closed

Databricks SQL Performance Tuning

Published 2023-05-30 by Kevin Feasel

Katie Cummiskey provides some tips for us:

We previously discussed how to use Power BI on top of Databricks Lakehouse efficiently. However, the well-designed and efficient Lakehouse itself is the basement for overall performance and good user experience. We will discuss recommendations for physical layout of Delta tables, data modeling, as well as recommendations for Databricks SQL Warehouses.

These tips and techniques proved to be efficient based on our field experience. We hope you will find them relevant for your Lakehouse implementations too.

Read on for these tips.

Comments closed

Adding Count to a Grouped DataFrame in Spark

Published 2023-05-25 by Kevin Feasel

The Big Data in Real World team does some counting:

We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.

One important thing to remember about Spark transformations is that they’re lazy: just because you ran df.groupBy(...).agg(...) doesn’t mean the new DataFrame exists yet, so until you call the show() action (or whatever), the original data is still there for the taking, which is how you can reference it again later in the chained statement.

Comments closed

Building Your First Spark SQL Application

Published 2023-05-22 by Kevin Feasel

Dustin Vannoy has a new video for us:

Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own application with Apache Spark.

Click through for the video and sample code.

Comments closed

Query Snowflake Data from Spark

Published 2023-05-18 by Kevin Feasel

The Big Data in Real World team crosses data platforms:

If your organization is working with lots of data you might be leveraging Spark to compute distribution. You could also potentially have some or all your data in a Snowflake data warehouse.

In a situation like this, you might have to expose data in Snowflake to the processes that run on Spark. This is made possible using the Spark Connector for Snowflake.

In this post, we will see what is Spark connector for Snowflake and how to use it from Spark to connect to Snowflake and access data from Snowflake in your Spark cluster.

Read on for a high-level architecture of how it works and the configuration you’ll need to do to get it running.

Comments closed

Databricks SQL in VSCode

Published 2023-05-15 by Kevin Feasel

Falek Miah tries out an extension:

Recently, I had the opportunity to explore the Databricks SQL extension for VSCode, and I was thoroughly impressed.

In December 2022, Databricks launched the Databricks Driver for SQLTools extension, and although it is still in preview, the features are already good and useful.

For data analysts, report developers and data engineers, having the ability to execute SQL queries against Databricks workspace objects is crucial for streamlining workflows and making data analysis activities much more efficient and quicker. The Databricks SQL extension for VSCode provides just that, with a simple and intuitive interface, this extension makes it easy to connect to Databricks workspace and run SQL queries directly from VSCode.

Click through for Falek’s thoughts. And if Databricks SQL is brand new to you, Falek also has a primer on it.

Comments closed

Creating Your First PySpark Application

Published 2023-05-02 by Kevin Feasel

Dustin Vannoy gives us a primer on Apache Spark:

Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own PySpark application. Pro tip: Search for the Spark equivalent of functions you use in other programming languages (including SQL). Many will exist in the pyspark.sql.functions module.

In addition to the code listing, Dustin has a video walking us through the process.

Comments closed

Transposing a Spark DataFrame

Published 2023-05-01 by Kevin Feasel

The Big Data in Real World team makes a pivot:

Unfortunately there is no built in function to transpose a DataFrame from columns to rows in Spark. In this post we will show an easy way to transpose a DataFrame from columns to rows in Spark.

Click through for the process.

Comments closed

Common Challenges Implementing PySpark Code

Published 2023-04-27 by Kevin Feasel

Amlan Patnaik looks at some common implementation problems:

Pyspark has become one of the most popular tools for data processing and data engineering applications. It is a fast and efficient tool that can handle large volumes of data and provide scalable data processing capabilities. However, Pyspark applications also come with their own set of challenges that data engineers face on a day-to-day basis. In this article, we will discuss some of the common challenges faced by data engineers in Pyspark applications and the possible solutions to overcome these challenges.

Read on for five such challenges.

Comments closed

Map vs MapValues in Spark

Published 2023-04-18 by Kevin Feasel

The Big Data in Real World team explain how a couple of Spark functions work:

In this post we will look at the differences between map and mapValues functions and when it is appropriate to use either one.

We have a small made up dataset with Day and temperature in Fahrenheit. Let’s use both map() and mapValues() to convert them to Celsius.

Click through to learn the difference between these two functions.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark