Category: Spark

Visualizing a Spark Execution Plan

Published 2024-05-01 by Kevin Feasel

Gerhard Brueckl builds a very helpful tool:

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

Read on for Gerhard’s answer, including an example of it in action.

Comments closed

Speeding up Databricks Lakehouse Queries with Redis

Published 2024-04-11 by Kevin Feasel

Drew Furgiuele has the need for speed:

Since compute and storage are now separated, this means that any time you want to work with your data, you need some form of compute engine that is capable of connecting to and reading your data from your storage locations. Compute engines vary, but one of the best is Apache Spark, which gives you a great distributed compute layer suitable for all sorts of workloads, whether they be analytical and ad-hoc queries, dashboard or BI workloads, data engineering related, or even data science or AI/ML use cases. It really can do it all, and it does it very well.

But what about use operational use cases? For instance: let’s say your Lakehouse is hosting some data that is critical to customer-facing systems that demand low-latency response times, such as real-time users lookups, API interfaces, or event-driven systems, sometimes the overhead required to take a query, schedule it, and run it can be in the hundreds of milliseconds. For some workloads, that’s a lifetime.

Read on to see how you can build a caching layer on top of certain lakehouse operations when some operation needs to be as fast as possible.

Comments closed

Date Calculation Bug in Power Query ODBC Code

Published 2024-04-03 by Kevin Feasel

Meagan Longoria files a report:

I was working on an imported Power BI semantic model, adding some fiscal year calculations to my date table. The date table was sourced from a view in Databricks Unity Catalog. I didn’t have access to add more fields to the view, so I was adding the fields in Power Query first, with plans to request they be added to the view in the future. I got some unexpected results, which turned into a bug being logged for the ODBC code for Power Query.

If you are only analyzing data in the last 20 years, you won’t see this bug. But if you are doing long-term analysis including years before 2000, you might just run into it.

Read on to see the bug, how you can replicate it, and three workarounds you can use to avoid it.

Comments closed

Testing with Databricks

Published 2024-03-26 by Kevin Feasel

Anh Nguyen Viet shares some thoughts on testing in Databricks:

With diverse support and a focus on workspace uniformity, Databricks can bring many benefits to the testing process, such as the following:

Centralized: Databricks provides an integrated environment for many teams (including testing team also), allowing them to work focused and productive. Integrating tools and services in a single platform reduces fragmentation and increases efficiency during testing.

Consistency: Databricks offers integrated tools and services, allowing testers to work consistently across the entire testing process as a uniform and efficient working environment.

Enhanced Productivity and Cost Reduction: With the flexibility and efficiency in data processing supported by DataBricks, testers can save time and effort, thereby increasing work productivity and reducing project costs. Utilizing utilities properly helps automate the testing process and delivers better results.

Read on for a few tips around building tests using Databricks.

Comments closed

Changing the Timeout of a Spark Session in Microsoft Fabric

Published 2024-03-20 by Kevin Feasel

Koen Verbeeck doesn’t have time to wait:

You might know the feeling: you’re writing code in a Notebook in Microsoft Fabric and suddenly you have to leave your workstation for a while. Someone ran the doorbell (you’re working from home and you get some parcels delivered), or you took a coffee break with some colleagues. When you return to your notebook, the Spark session has timed out and when you run a cell, you have to wait for the damn thing to restart again. The agony, waiting for 2-3 minutes for the session to start, and only after that the actual code can start running.

Read on to see how you can set the timeout to a custom value, assuming you’re okay with paying for the Spark cluster to sit around until it times out.

Comments closed

Retrieving Spark Session Config Variables from Microsoft Fabric

Published 2024-03-14 by Kevin Feasel

Koen Verbeeck gets some settings:

I was trying some stuff out in a notebook on top of a Microsoft Fabric Lakehouse. I was wondering what some of the default values are of the configuration variables, and if there’s an easy way to retrieve them all. Luckily there is. In the code, I’m using Scala because it has a nice GetAll() function.

Click through for an example of how to use this. And bonus points for using Scala instead of Python here.

Comments closed

Using Apache Spark in Microsoft Fabric

Published 2024-03-01 by Kevin Feasel

Ginger Grant gives us an overview of where we can use Apache Spark in Microsoft Fabric:

If you have used Spark in Azure Synapse, prepare to be pleasantly surprised with the compute experience in Microsoft Fabric as Spark compute starts a lot faster because the underlying technology has changed. The Data Engineering and Data Science Fabric experiences include a managed Spark compute, which like previous Spark compute charges you when it is in use. The difference is the nodes are reserved for you, rather than allocated when you start the compute which results in compute starting in 30 seconds or less versus the 4 minutes of waiting it takes for Azure Synapse compute to start. If you have different capacity needs that a default managed Spark compute will not provide, you can always create a custom pool. Custom pools are created in a specific workspace, so you will need Administrator permissions on the workspace to create them. You can choose to make the new pool your default pool as well, so it will be what starts in the workspace.

Read on for more of Ginger’s thoughts on the matter, including how you can use Copilot in Microsoft Fabric (if you pay for it) to help generate Spark code.

Comments closed

Renaming Multiple Columns in a PySpark Notebook

Published 2024-02-29 by Kevin Feasel

Gilbert Quevauvilliers wants one rename to rule them all:

Following on from my previous blog post this blog post I’m going to demonstrate how to bulk rename column names in a single step instead of having to rename them individually.

The reason this came about is because I had a set of data where the column names had the square brackets which I wanted to remove.

As shown below I have highlighted 2 column names with the square brackets.

Read on to see how you can perform somewhat-generic rename operations in Spark notebooks.

Comments closed

Using Databricks System Tables

Published 2024-02-23 by Kevin Feasel

Dustin Vannoy has a primer on system tables in Databricks:

Monitoring is important, so I’ve covered the topic a few times in the past. I’ve talked about collecting your Spark application logs and Spark metrics. These are a good way to track what is happening and what is going wrong as your code runs. In the video related to this post I focus on a different side of monitoring. The evolving capabilities offered by Databricks System Tables. I have some sample queries and links to help you get started and begin to get value from system tables. This will need to be updated (I’ll try) as new tables go into public preview status. So let’s discuss the questions I had when I first started researching this feature:
1) What do the Databricks system tables offer me for monitoring?
2) How much does this overlap with the application logs and metrics?

Click through for a video and a walkthrough.

Comments closed

Building Functions with Spark Connect and .NET

Published 2024-02-09 by Kevin Feasel

Ed Elliott continues a series on Spark Connect:

I’m pretty much going to leave the code as-is from the previous post but will move things about a bit and add a SparkSession and a DataFrame class. Also, instead of passing the session id and client around i’m going to wrap them in the SparkSession so that we can just pass a single object and also use it to construct the DataFrame so we don’t even have to worry about passing it around.

The first thing is to take all of that gRPC connection stuff and shove in into SparkSession so it is hidden from the callers:

Read on for the end state that Ed is headed toward and how to get closer to that state.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31