Spark – Page 11 – Curated SQL

Adding Count to a Grouped DataFrame in Spark

Published 2023-05-25 by Kevin Feasel

The Big Data in Real World team does some counting:

We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.

One important thing to remember about Spark transformations is that they’re lazy: just because you ran df.groupBy(...).agg(...) doesn’t mean the new DataFrame exists yet, so until you call the show() action (or whatever), the original data is still there for the taking, which is how you can reference it again later in the chained statement.

Comments closed

Building Your First Spark SQL Application

Published 2023-05-22 by Kevin Feasel

Dustin Vannoy has a new video for us:

Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own application with Apache Spark.

Click through for the video and sample code.

Comments closed

Query Snowflake Data from Spark

Published 2023-05-18 by Kevin Feasel

The Big Data in Real World team crosses data platforms:

If your organization is working with lots of data you might be leveraging Spark to compute distribution. You could also potentially have some or all your data in a Snowflake data warehouse.

In a situation like this, you might have to expose data in Snowflake to the processes that run on Spark. This is made possible using the Spark Connector for Snowflake.

In this post, we will see what is Spark connector for Snowflake and how to use it from Spark to connect to Snowflake and access data from Snowflake in your Spark cluster.

Read on for a high-level architecture of how it works and the configuration you’ll need to do to get it running.

Comments closed

Databricks SQL in VSCode

Published 2023-05-15 by Kevin Feasel

Falek Miah tries out an extension:

Recently, I had the opportunity to explore the Databricks SQL extension for VSCode, and I was thoroughly impressed.

In December 2022, Databricks launched the Databricks Driver for SQLTools extension, and although it is still in preview, the features are already good and useful.

For data analysts, report developers and data engineers, having the ability to execute SQL queries against Databricks workspace objects is crucial for streamlining workflows and making data analysis activities much more efficient and quicker. The Databricks SQL extension for VSCode provides just that, with a simple and intuitive interface, this extension makes it easy to connect to Databricks workspace and run SQL queries directly from VSCode.

Click through for Falek’s thoughts. And if Databricks SQL is brand new to you, Falek also has a primer on it.

Comments closed

Creating Your First PySpark Application

Published 2023-05-02 by Kevin Feasel

Dustin Vannoy gives us a primer on Apache Spark:

Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own PySpark application. Pro tip: Search for the Spark equivalent of functions you use in other programming languages (including SQL). Many will exist in the pyspark.sql.functions module.

In addition to the code listing, Dustin has a video walking us through the process.

Comments closed

Transposing a Spark DataFrame

Published 2023-05-01 by Kevin Feasel

The Big Data in Real World team makes a pivot:

Unfortunately there is no built in function to transpose a DataFrame from columns to rows in Spark. In this post we will show an easy way to transpose a DataFrame from columns to rows in Spark.

Click through for the process.

Comments closed

Common Challenges Implementing PySpark Code

Published 2023-04-27 by Kevin Feasel

Amlan Patnaik looks at some common implementation problems:

Pyspark has become one of the most popular tools for data processing and data engineering applications. It is a fast and efficient tool that can handle large volumes of data and provide scalable data processing capabilities. However, Pyspark applications also come with their own set of challenges that data engineers face on a day-to-day basis. In this article, we will discuss some of the common challenges faced by data engineers in Pyspark applications and the possible solutions to overcome these challenges.

Read on for five such challenges.

Comments closed

Map vs MapValues in Spark

Published 2023-04-18 by Kevin Feasel

The Big Data in Real World team explain how a couple of Spark functions work:

In this post we will look at the differences between map and mapValues functions and when it is appropriate to use either one.

We have a small made up dataset with Day and temperature in Fahrenheit. Let’s use both map() and mapValues() to convert them to Celsius.

Click through to learn the difference between these two functions.

Comments closed

Spark ELT in Synapse Notebooks

Published 2023-04-10 by Kevin Feasel

Liliam Leme performs some data movement:

I often receive various requests from customers while working on FastTrack projects, and I have compiled some examples to help you build your solution on top of a data lake using useful tips. Most of the examples in this post use pandas, and I hope they will be helpful for you as they were for me.

Please note that all examples in this post use pyspark.

In my scenario, I exported multiple tables from SQLDB to a folder using a notebook and ran the requests in parallel.

Read on for the examples and some of the things you can do with Spark notebooks in Azure Synapse Analytics.

Comments closed

Generating Artificial Data in Databricks

Published 2023-03-29 by Kevin Feasel

Ben Hazan needs some fake data:

While attending the SQLBits 2023, I took part in André Kamman’s session about “Generate test data quick, easy and lots of it with the Databricks Labs Data Generator”.

In this blog, I will share with you my insights about the DataBricks Data Generator library and I’ll give an example.

Synthetic data is a valuable resource for data scientists, engineers, and analysts who need to test, benchmark, or demonstrate their solutions without compromising sensitive or confidential information. However, generating realistic and relevant synthetic data can be challenging and time-consuming.

That’s why Databricks Labs has developed a Python library called dbldatagen that can help you create large-scale synthetic data sets using Spark.

Click through to learn more about the library and see how you can use to to generate arbitrary amounts of artificial data following certain constraints.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Spark