Spark – Curated SQL

We’re thrilled to announce the general availability (GA) of Autoscale Billing for Apache Spark in Microsoft Fabric — a serverless billing model designed to offer greater flexibility, transparency, and cost efficiency for running Spark workloads at scale.

With this model now fully supported, Spark Jobs can run independently of your Fabric capacity and are billed on a pay-as-you-go basis — similar to how Spark works in Azure Synapse. This gives teams the freedom to scale compute as needed without impacting other workloads running on your shared Fabric capacity.

I’m of two minds here. On the one hand, there is value to having this as an option. On the other hand, one of the talking points for Microsoft Fabric is that you have one billing model. But because it’s an optional thing you can enable rather than something you must use, I’m fine with it.

Optimizing Multi-Notebook Jobs in Microsoft Fabric and AWS Glue

Published 2025-07-24 by Kevin Feasel

Daniel Janik flips a switch:

Are your Azure Fabric pipelines with multiple notebooks running slower than you’d like? Are you paying for more Spark compute time than you should be? The culprit might be a simple setting that’s easy to miss. In this blog post, we’ll dive into the “For pipeline running multiple notebooks” setting in Azure Fabric and explain why enabling it can significantly improve your pipeline’s performance and reduce your costs.

Click through for this, as well as a comparison with AWS Glue and ways to perform something similar there.

Accessing Delta Lake via Spark Connect

Published 2025-06-23 by Kevin Feasel

Ed Elliott has some code for us:

I have just finished an update for the spark connect dotnet lib that contains the DeltaTable implementation so that we can now use .NET to maintain delta tables, over and above what we get out of the box by using DataFrame.Write.Format("delta"), this is an example of how to use the delta api from .NET:

Click through for the example, and you can learn more about Spark Connect from the Spark website and Ed has a .NET client for the task.

Comments closed

Spark Streaming plus Drools

Published 2025-06-11 by Kevin Feasel

Ram Ghadiyaram builds a tool:

Near real-time decision-making systems are critical for modern business applications. Integrating Apache Spark (Streaming) and Drools provides scalability and flexibility, enabling efficient handling of rule-based decision-making at scale. This article showcases their integration through a loan approval system, demonstrating its architecture, implementation, and advantages.

Click through for a bit of sample code.

Comments closed

Apache Spark 3.5 Support in Azure Synapse Analytics

Published 2025-06-03 by Kevin Feasel

Arshad Ali has an announcement:

You can now create Azure Synapse Runtime for Apache Spark 3.5. The essential changes include features which come from upgrading Apache Spark to version 3.5 and Delta Lake 3.2. Please review the official release notes for Apache Spark 3.5 to check the complete list of fixes and features. In addition, review the migration guidelines between Spark 3.4 and 3.5 to assess potential changes to your applications, jobs and notebooks.

Credit where credit is due: I’ve made light of the utter lack of work on Azure Synapse Analytics since Microsoft Fabric’s release. But hey, they did a thing. Granted, the impetus behind this was to “prepare for migrating to Microsoft Fabric Spark.”

Comments closed

What’s New in Apache Spark 4.0

Published 2025-06-03 by Kevin Feasel

Ram Ghadiyaram looks at recent updates to Apache Spark:

Hurray! Apache Spark 4.0, released in 2025, redefines big data processing with innovations that enhance performance, accessibility, and developer productivity. With contributions from over 400 developers across organizations like Databricks, Apple, and NVIDIA, Spark 4.0 resolves thousands of JIRA issues, introducing transformative features: native plotting in PySpark, Python Data Source API, polymorphic User-Defined Table Functions (UDTFs), state store enhancements, SQL scripting, and Spark Connect improvements. This report provides an in-depth exploration of these features, their technical underpinnings, and practical applications through original examples and diagrams.

Click through to see what’s on the list of major features.

Comments closed

Automated Table Statistics on Delta Tables in Microsoft Fabric

Published 2025-05-30 by Kevin Feasel

Santhosh Kumar Ravindran makes an announcement:

We’re thrilled to introduce Automated Table Statistics in Microsoft Fabric Data Engineering — a major upgrade that helps you get blazing-fast query performance with zero manual effort.

Whether you’re running complex joins, large aggregations, or heavy filtering workloads, Fabric’s new automated statistics will help Spark make smarter decisions, saving you time, compute, and money.

Click through to see what’s included, as well as the limitations associated with this. You can still create manual statistics if you’d like, so on the whole, I approve.

Comments closed

Checking Key Vault Access in Microsoft Fabric Spark Notebooks

Published 2025-05-09 by Kevin Feasel

Marc Lelijveld has clearance:

Working with sensitive data in Microsoft Fabric requires careful handling of secrets, especially when collaborating externally. In a recent customer engagement, I needed to validate access to Azure Key Vault from within a Fabric Notebook, without ever exposing the actual secret values. With only read access granted and no need to manage or update secrets, I focused on confirming that the connection was working as expected.

In this blog, I’ll walk you through the approach, including the setup, code snippets, and logic behind this quick but crucial verification step.

Click through for the full story.

Comments closed

When to Use a Python Notebook vs Spark Notebook in Microsoft Fabric

Published 2025-04-23 by Kevin Feasel

Gilbert Quevauvilliers lays out the plan:

This is the first blog post in a series of blog posts where I dive into how to use Python notebooks instead of Spark notebooks. For example, I will show you how to run a SQL query from a Lakehouse table and get it into a data frame. Read and write to a Lakehouse table and more.

NOTE: This is still in preview, but I personally think that this is worth investing time in learning.

The reason I am using the term Python is because the notebook can ONLY use Python and not any of the other languages available in a Spark

Also, in fairness, I’ve heard people working on Microsoft Fabric within the company reference these as ‘Python notebooks,’ so Gilbert is in good company.

Comments closed

Customizing Spark Settings in Microsoft Fabric Workspaces

Published 2025-04-11 by Kevin Feasel

Nikola Ilic doesn’t accept the default:

In this article, I’ll walk you through how to go from out-of-the-box default Spark configurations to a fine-tuned setup that suits your specific workloads and requirements, as well as getting you ready for the DP-700 exam.

Spark is an extremely powerful engine, but like any powerful tool, it runs best when you tune it. So, don’t always settle for default. Get dynamic—and get Spark working the way you need it to.

Click through for the explanation of functionality.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Spark

Auto-Scale Billing for Spark in Microsoft Fabric now GA

Optimizing Multi-Notebook Jobs in Microsoft Fabric and AWS Glue

Accessing Delta Lake via Spark Connect

Spark Streaming plus Drools

Apache Spark 3.5 Support in Azure Synapse Analytics

What’s New in Apache Spark 4.0

Automated Table Statistics on Delta Tables in Microsoft Fabric

Checking Key Vault Access in Microsoft Fabric Spark Notebooks

When to Use a Python Notebook vs Spark Notebook in Microsoft Fabric

Customizing Spark Settings in Microsoft Fabric Workspaces