Press "Enter" to skip to content

Category: Spark

Changing the Timeout of a Spark Session in Microsoft Fabric

Koen Verbeeck doesn’t have time to wait:

You might know the feeling: you’re writing code in a Notebook in Microsoft Fabric and suddenly you have to leave your workstation for a while. Someone ran the doorbell (you’re working from home and you get some parcels delivered), or you took a coffee break with some colleagues. When you return to your notebook, the Spark session has timed out and when you run a cell, you have to wait for the damn thing to restart again. The agony, waiting for 2-3 minutes for the session to start, and only after that the actual code can start running.

Read on to see how you can set the timeout to a custom value, assuming you’re okay with paying for the Spark cluster to sit around until it times out.

Comments closed

Retrieving Spark Session Config Variables from Microsoft Fabric

Koen Verbeeck gets some settings:

I was trying some stuff out in a notebook on top of a Microsoft Fabric Lakehouse. I was wondering what some of the default values are of the configuration variables, and if there’s an easy way to retrieve them all. Luckily there is. In the code, I’m using Scala because it has a nice GetAll() function.

Click through for an example of how to use this. And bonus points for using Scala instead of Python here.

Comments closed

Using Apache Spark in Microsoft Fabric

Ginger Grant gives us an overview of where we can use Apache Spark in Microsoft Fabric:

If you have used Spark in Azure Synapse, prepare to be pleasantly surprised with the compute experience in Microsoft Fabric as Spark compute starts a lot faster because the underlying technology has changed. The Data Engineering and Data Science Fabric experiences include a managed Spark compute, which like previous Spark compute charges you when it is in use. The difference is the nodes are reserved for you, rather than allocated when you start the compute which results in compute starting in 30 seconds or less versus the 4 minutes of waiting it takes for Azure Synapse compute to start.  If you have different capacity needs that a default managed Spark compute will not provide, you can always create a custom pool.  Custom pools are created in a specific workspace, so you will need Administrator permissions on the workspace to create them. You can choose to make the new pool your default pool as well, so it will be what starts in the workspace.

Read on for more of Ginger’s thoughts on the matter, including how you can use Copilot in Microsoft Fabric (if you pay for it) to help generate Spark code.

Comments closed

Renaming Multiple Columns in a PySpark Notebook

Gilbert Quevauvilliers wants one rename to rule them all:

Following on from my previous blog post this blog post I’m going to demonstrate how to bulk rename column names in a single step instead of having to rename them individually.

The reason this came about is because I had a set of data where the column names had the square brackets which I wanted to remove.

As shown below I have highlighted 2 column names with the square brackets.

Read on to see how you can perform somewhat-generic rename operations in Spark notebooks.

Comments closed

Using Databricks System Tables

Dustin Vannoy has a primer on system tables in Databricks:

Monitoring is important, so I’ve covered the topic a few times in the past. I’ve talked about collecting your Spark application logs and Spark metrics. These are a good way to track what is happening and what is going wrong as your code runs. In the video related to this post I focus on a different side of monitoring. The evolving capabilities offered by Databricks System Tables. I have some sample queries and links to help you get started and begin to get value from system tables. This will need to be updated (I’ll try) as new tables go into public preview status. So let’s discuss the questions I had when I first started researching this feature:
1) What do the Databricks system tables offer me for monitoring?
2) How much does this overlap with the application logs and metrics?

Click through for a video and a walkthrough.

Comments closed

Building Functions with Spark Connect and .NET

Ed Elliott continues a series on Spark Connect:

I’m pretty much going to leave the code as-is from the previous post but will move things about a bit and add a SparkSession and a DataFrame class. Also, instead of passing the session id and client around i’m going to wrap them in the SparkSession so that we can just pass a single object and also use it to construct the DataFrame so we don’t even have to worry about passing it around.

The first thing is to take all of that gRPC connection stuff and shove in into SparkSession so it is hidden from the callers:

Read on for the end state that Ed is headed toward and how to get closer to that state.

Comments closed

Exploring the gRPC API in Spark Connect with .NET

Ed Elliott continues a series on Spark Connect. First, Ed builds out something DataFrame API-ish:

So there are two goals of this post, the first is to take a look at Apache Arrow and how we can do things like show the output from DataFrame.Show, the second is to start to create objects that look more familiar to us, i.e. the DataFrame API.

If that’s not enough for you, Ed then shows how you can analyze a plan:

In this post we will continue looking at the gRPC API and the AnalyzePlan method which takes a plan and analyzes it. To be honest I expected this to be longer but decided just to do the AnalyzePlan method.

This has been a really fun series so far from Ed, so check these out. The only downside is that the people demand more F#. And by “the people,” I mostly mean that I would love to see F# examples.

Comments closed

Using the Spark Connect GRPC API

Ed Elliott digs into API details:

In the first two posts, we looked at how to run some Spark code, firstly against a local Spark Connect server and then against a Databricks cluster. In this post, we will look more at the actual gRPC API itself, namely ExecutePlan, Config, and AddArtifacts/ArtifactsStatus.

Click through to see how it all works, with plenty of C# code to guide you along the way.

Comments closed

Running Spark Jobs on Databricks with Spark Connect and .NET

Ed Elliott runs a Databricks job:

This post aims to show how we can create a .NET application, deploy it to Databricks, and then run a Databricks job that calls our .NET code, which uses Spark Connect to run a Spark job on the Databricks job cluster to write some data out to Azure storage.

In the previous post, I showed how to use the Range command to create a Spark DataFrame and then save it locally as a parquet file. In this post, we will use the Sql command, which will return a DataFrame or, in our world, a Relation. We will then pass that relation to a WriteOperation command, which will write the results of the Sql out to Azure storage.

The code is available HERE

Read on for the description of how everything works.

Comments closed

Using Spark Connect from .NET

Ed Elliott keeps the hope alive:

Over the past couple of decades working in IT, I have found a particular interest in protocols. When I was learning how MSSQL worked, I spent a while figuring out how to read data from disk via backups rather than via the database server (MS Tape Format, if anyone cared). I spent more time than anyone should learning how to parse TDS (before the [MS-TDS] documentation was a thing)—having my head buried in a set of network traces and a pencil and pen has given me more pleasure than I can tell you.

This intersection of protocols and Spark piqued my interest in using Spark Connect to connect to Spark and run jobs from .NET rather than Python or Scala.

There’s a whole lot more ceremony involved than the Microsoft .NET for Apache Spark project, but read on to see how it all works. Also, I hereby officially chastise Ed for having examples in C# and VB.NET but not the greatest .NET language of them all: F#. Chastisement aside, I appreciate the work Ed put into this to bring Spark Connect to the .NET masses.

Comments closed