Press "Enter" to skip to content

Day: January 30, 2024

Exploring the gRPC API in Spark Connect with .NET

Ed Elliott continues a series on Spark Connect. First, Ed builds out something DataFrame API-ish:

So there are two goals of this post, the first is to take a look at Apache Arrow and how we can do things like show the output from DataFrame.Show, the second is to start to create objects that look more familiar to us, i.e. the DataFrame API.

If that’s not enough for you, Ed then shows how you can analyze a plan:

In this post we will continue looking at the gRPC API and the AnalyzePlan method which takes a plan and analyzes it. To be honest I expected this to be longer but decided just to do the AnalyzePlan method.

This has been a really fun series so far from Ed, so check these out. The only downside is that the people demand more F#. And by “the people,” I mostly mean that I would love to see F# examples.

Comments closed

Parallelizing Notebook Runs in Microsoft Fabric via Python

Sandeep Pawar kicks off multiple notebooks at once:

The notebook class in mssparkutils has two methods to run notebooks – run and runMultiple . run allows you to trigger a notebook run for one single notebook. Mim wrote a nice blog to show how to use it and its usefulness.

runMultiple , on the other hand, allows you to create a Direct Acyclic Graph (DAG) of notebooks to execute notebooks in parallel and in specified order, similar to a pipeline run except in a notebook.

Read on to learn more about the advantages of this latter approach as well as how you can do it.

Comments closed

Replacing DISTINCT with EXISTS

Andy Brownsword makes a switch:

The DISTINCT clause in a query can help us quickly remove duplicates from our results. Sometimes it can be beneficial to stop and ask why. Why do we need to use the clause, why are we receiving duplicates from our data?

I see this typically due to a JOIN being used where we don’t really want all of those results. This could be a ‘does something exist’ check such as if a customer has ever ordered before. The issue comes when there are multiple rows returned like a frequent customer in this example.

As an alternative to this, Andy shows how you can use the EXISTS clause to find records matching some criterion.

Comments closed