Press "Enter" to skip to content

Category: Spark

Databricks SQL Performance Tuning

Katie Cummiskey provides some tips for us:

We previously discussed how to use Power BI on top of Databricks Lakehouse efficiently. However, the well-designed and efficient Lakehouse itself is the basement for overall performance and good user experience.  We will discuss recommendations for physical layout of Delta tables, data modeling, as well as recommendations for Databricks SQL Warehouses.

These tips and techniques proved to be efficient based on our field experience. We hope you will find them relevant for your Lakehouse implementations too.

Read on for these tips.

Comments closed

Adding Count to a Grouped DataFrame in Spark

The Big Data in Real World team does some counting:

We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.

One important thing to remember about Spark transformations is that they’re lazy: just because you ran df.groupBy(...).agg(...) doesn’t mean the new DataFrame exists yet, so until you call the show() action (or whatever), the original data is still there for the taking, which is how you can reference it again later in the chained statement.

Comments closed

Building Your First Spark SQL Application

Dustin Vannoy has a new video for us:

Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own application with Apache Spark.

Click through for the video and sample code.

Comments closed

Query Snowflake Data from Spark

The Big Data in Real World team crosses data platforms:

If your organization is working with lots of data you might be leveraging Spark to compute distribution. You could also potentially have some or all your data in a Snowflake data warehouse.

In a situation like this, you might have to expose data in Snowflake to the processes that run on Spark. This is made possible using the Spark Connector for Snowflake.

In this post, we will see what is Spark connector for Snowflake and how to use it from Spark to connect to Snowflake and access data from Snowflake in your Spark cluster.

Read on for a high-level architecture of how it works and the configuration you’ll need to do to get it running.

Comments closed

Databricks SQL in VSCode

Falek Miah tries out an extension:

Recently, I had the opportunity to explore the Databricks SQL extension for VSCode, and I was thoroughly impressed.

In December 2022, Databricks launched the Databricks Driver for SQLTools extension, and although it is still in preview, the features are already good and useful.

For data analysts, report developers and data engineers, having the ability to execute SQL queries against Databricks workspace objects is crucial for streamlining workflows and making data analysis activities much more efficient and quicker. The Databricks SQL extension for VSCode provides just that, with a simple and intuitive interface, this extension makes it easy to connect to Databricks workspace and run SQL queries directly from VSCode.

Click through for Falek’s thoughts. And if Databricks SQL is brand new to you, Falek also has a primer on it.

Comments closed

Creating Your First PySpark Application

Dustin Vannoy gives us a primer on Apache Spark:

Get hands on with Python and PySpark to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset which can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own PySpark application. Pro tip: Search for the Spark equivalent of functions you use in other programming languages (including SQL). Many will exist in the pyspark.sql.functions module.

In addition to the code listing, Dustin has a video walking us through the process.

Comments closed

Common Challenges Implementing PySpark Code

Amlan Patnaik looks at some common implementation problems:

Pyspark has become one of the most popular tools for data processing and data engineering applications. It is a fast and efficient tool that can handle large volumes of data and provide scalable data processing capabilities. However, Pyspark applications also come with their own set of challenges that data engineers face on a day-to-day basis. In this article, we will discuss some of the common challenges faced by data engineers in Pyspark applications and the possible solutions to overcome these challenges.

Read on for five such challenges.

Comments closed

Spark ELT in Synapse Notebooks

Liliam Leme performs some data movement:

I often receive various requests from customers while working on FastTrack projects, and I have compiled some examples to help you build your solution on top of a data lake using useful tips. Most of the examples in this post use pandas, and I hope they will be helpful for you as they were for me.

Please note that all examples in this post use pyspark.

In my scenario, I exported multiple tables from SQLDB to a folder using a notebook and ran the requests in parallel.

Read on for the examples and some of the things you can do with Spark notebooks in Azure Synapse Analytics.

Comments closed