Press "Enter" to skip to content

Category: Spark

Killing a Running Apache Spark Application

The Big Data in Real World team pulls the plug on an application:

Apache Spark is a powerful open-source distributed computing system used for big data processing. However, sometimes you may need to kill a running Spark application for various reasons, such as if the application is stuck, consuming too many resources, or taking too long to complete. In this post, we will discuss how to kill a running Spark application.

Click through to see how you can do this.

Comments closed

Apache Spark Execution Plan Analysis

Karthik Penikalapati digs into Spark SQL explain plans:

In this blog post, we will explore how the Explain Plan can be your secret weapon for debugging and optimizing Spark applications. We’ll dive into the basics and provide clear examples in Spark Scala to help you understand how to leverage this valuable tool.

All I’m saying is, if some company wants to create SQL Sentry Plan Explorer for Apache Spark, I’d be down with it. That loss of an intuitive and powerful graphical interface for execution plans is definitely a point of friction when working with Apache Spark and Spark SQL.

Comments closed

An Intro to Databricks Asset Bundles

Dustin Vannoy covers one technique for CI/CD in Databricks:

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.

Click through for a video and some sample scripts.

Comments closed

Parameterizing Databricks Notebooks with Widgets

Meagan Longoria adds some widgets:

Widgets provide a way to parameterize notebooks in Databricks. If you need to call the same process for different values, you can create widgets to allow you to pass the variable values into the notebook, making your notebook code more reusable. You can then refer to those values throughout the notebook.

Click through to learn more about the four types of widgets and how they work.

Comments closed

SparkSQL CONCAT vs T-SQL CONCAT

Bill Fellows has a public service announcement:

The concat function is super handy in the database world but be aware that the SQL Server one is way better because it solves two problems. It combines everything into a string and it does not require NULL checking. In the before times, one had to down cast to a n/var/char type as well as check for NULL before appending strings via the plus sign.

The point of difference is so important that Bill busted out the marquee HTML tag. Which now leads me to wonder, was marquee or blink the bigger evil in the mid-to-late ’90s web?

Comments closed

Setting a Spark Compute Pool Size in Microsoft Fabric

Reitse Eskens manages compute pools:

This next blog won’t be a long one and will probably serve most as a reminder for myself where to find the settings for the Spark compute pool.

When you create a workspace, you get the default starter pool and it has taken me way longer than I care to admit to find where to find the setting and, more importantly, how to change it.

Read on to learn more about how to create a Spark pool of the size you desire. The sizing method is essentially the same as with Azure Synapse Analytics.

Comments closed

Generating Tables from Files in Microsoft Fabric via Notebook

Dennes Torres performs a bit of ELT:

When Microsoft Fabric was born, the only method to convert files to tables was using notebooks. Nowadays we have an easy-to-use UI feature for the conversion.

As I explained on the article about lakehouse and ETL, there are some scenarios where we still need to use notebooks for the conversion. One of these scenarios is when we need table partitioning.

Let’s make a step-by-step on this blog about how to use notebooks and table partitioning.

Click through to see how it all works.

Comments closed

TINYINT Casts in Spark SQL vs T-SQL

Bill Fellows runs into an interesting oddity:

Yet another thing that has bitten me working in SparkSQL in Databricks—this time it’s data types.

In SQL Server, a tinyint ranges from 0 to 255 but both of them allow for 256 total values. If you attempt to cast a value that doesn’t fit in that range, you’re going to raise an error.

SQL Server’s TINYINT data type is an unsigned one-byte number, whereas TINYINT in Spark SQL is a signed one-byte number. But that’s not the biggest difference Bill finds, so check out the post to learn more.

Comments closed

Querying the Power BI REST API from Fabric Spark

Gerhard Brueckl makes the call:

Microsoft Fabric has a lot of different components which usually work very well together. However, even though Power BI is a fundamental part of Fabric, there is not really a tight integration between Data Engineering components and Power BI. In this blog post I will show you an easy and reusable way to query the Power BI REST API via Fabric SQL in a very straight forward way. The extracted data can then be stored in the data lake e.g. to create a history of your dataset refreshes, the state of your workspaces or any other information that is provided by the REST API.

Click through for a list of operations, followed by the code you’ll need to pull this off.

Comments closed