Press "Enter" to skip to content

Category: Spark

Parameterizing Databricks Notebooks with Widgets

Meagan Longoria adds some widgets:

Widgets provide a way to parameterize notebooks in Databricks. If you need to call the same process for different values, you can create widgets to allow you to pass the variable values into the notebook, making your notebook code more reusable. You can then refer to those values throughout the notebook.

Click through to learn more about the four types of widgets and how they work.

Comments closed

SparkSQL CONCAT vs T-SQL CONCAT

Bill Fellows has a public service announcement:

The concat function is super handy in the database world but be aware that the SQL Server one is way better because it solves two problems. It combines everything into a string and it does not require NULL checking. In the before times, one had to down cast to a n/var/char type as well as check for NULL before appending strings via the plus sign.

The point of difference is so important that Bill busted out the marquee HTML tag. Which now leads me to wonder, was marquee or blink the bigger evil in the mid-to-late ’90s web?

Comments closed

Setting a Spark Compute Pool Size in Microsoft Fabric

Reitse Eskens manages compute pools:

This next blog won’t be a long one and will probably serve most as a reminder for myself where to find the settings for the Spark compute pool.

When you create a workspace, you get the default starter pool and it has taken me way longer than I care to admit to find where to find the setting and, more importantly, how to change it.

Read on to learn more about how to create a Spark pool of the size you desire. The sizing method is essentially the same as with Azure Synapse Analytics.

Comments closed

Generating Tables from Files in Microsoft Fabric via Notebook

Dennes Torres performs a bit of ELT:

When Microsoft Fabric was born, the only method to convert files to tables was using notebooks. Nowadays we have an easy-to-use UI feature for the conversion.

As I explained on the article about lakehouse and ETL, there are some scenarios where we still need to use notebooks for the conversion. One of these scenarios is when we need table partitioning.

Let’s make a step-by-step on this blog about how to use notebooks and table partitioning.

Click through to see how it all works.

Comments closed

TINYINT Casts in Spark SQL vs T-SQL

Bill Fellows runs into an interesting oddity:

Yet another thing that has bitten me working in SparkSQL in Databricks—this time it’s data types.

In SQL Server, a tinyint ranges from 0 to 255 but both of them allow for 256 total values. If you attempt to cast a value that doesn’t fit in that range, you’re going to raise an error.

SQL Server’s TINYINT data type is an unsigned one-byte number, whereas TINYINT in Spark SQL is a signed one-byte number. But that’s not the biggest difference Bill finds, so check out the post to learn more.

Comments closed

Querying the Power BI REST API from Fabric Spark

Gerhard Brueckl makes the call:

Microsoft Fabric has a lot of different components which usually work very well together. However, even though Power BI is a fundamental part of Fabric, there is not really a tight integration between Data Engineering components and Power BI. In this blog post I will show you an easy and reusable way to query the Power BI REST API via Fabric SQL in a very straight forward way. The extracted data can then be stored in the data lake e.g. to create a history of your dataset refreshes, the state of your workspaces or any other information that is provided by the REST API.

Click through for a list of operations, followed by the code you’ll need to pull this off.

Comments closed

Adaptive Query Execution in Spark 3.0

The Big Data in Real World team talks on-the-fly execution plan changes:

Apache Spark, the popular distributed computing framework, has been widely adopted for processing large-scale data. With the release of Apache Spark 3.0, a groundbreaking feature called Adaptive Query Execution (AQE) was introduced. AQE addresses the limitations of traditional static execution plans by dynamically optimizing query execution based on runtime statistics. In this blog post, we will explore how AQE works and how it significantly improves the performance of Spark applications.

This sounds pretty similar to adaptive query processing in SQL Server, though a look at the Spark documentation shows that there are some practical differences in implementation versus what SQL Server does.

Comments closed

Microsoft Fabric Notebooks and Compute Limits

Reitse Eskens hits a wall:

In this case, my notebook threw an error at me but the command seemed to finish without any issue. Sounds vague? It did to me. The notebookcell I tried to run had a lot of stuff happening at the same time.

As you can see in the above screenshot, the status shows green checkmarks but there’s an error as well. The error message was not really clear to me, but that can really be me lack of deep level experience. So, I logged a call with Microsoft Support and see what they could come up with.

I’ve had enough experience with Spark to see the issue and figure the response, but click through for the screenshot and what Reitse did to resolve the issue.

Comments closed

Creating a Simple Date Dimension in Databricks

Chen Hirsh builds a table:

A date dimension is extremely useful and is required by most BI applications. This kind of dimension has a key of time level (day, month, etc.), and attributes that describe it such as year, month, etc. In your BI model, you join this dimension to facts on their date fields, to aggregate from day level to week, month, and year.

In this post, I will demonstrate how to create a date dimension on Azure Databricks using Python. A link to the complete Databricks notebook is at the end of the post.

Check out the code, as well as explanation, in that post.

Comments closed