Spark – Curated SQL

When repartition() Beats coalesce() in Spark

Published 2025-10-31 by Kevin Feasel

Janani Annur Thiruvengadam stands some common advice on its head:

If you’ve worked with Apache Spark, you’ve probably heard the conventional wisdom: “Use coalesce() instead of repartition() when reducing partitions — it’s faster because it avoids a shuffle.” This advice appears in documentation, blog posts, and is repeated across Stack Overflow threads. But what if I told you this isn’t always true?

In a recent production workload, I discovered that using repartition() instead of coalesce() resulted in a 33% performance improvement (16 minutes vs. 23 minutes) when writing data to fewer partitions. This counterintuitive result reveals an important lesson about Spark’s Catalyst optimizer that every Spark developer should understand.

Read on for the details on that scenario.

Comments closed

Job-Level Bursting in Microsoft Fabric Spark Jobs

Published 2025-10-20 by Kevin Feasel

Santhosh Kumar Ravindran announces a new feature:

Enabled (Default): When enabled, a single Spark job can leverage the full burst limit, consuming up to 3× CUs. This is ideal for demanding ETL processes or large analytical tasks that benefit from maximum immediate compute power.

Disabled: If you disable this switch, individual Spark jobs will be capped at the base capacity allocation. This prevents a single job from monopolizing the burst capacity, thereby preserving concurrency and improving the experience for multi-user, interactive scenarios.

Read on for the list of caveats and the note that it will cost extra money to flip that switch.

Comments closed

Optimized Compaction in Microsoft Fabric Spark

Published 2025-10-07 by Kevin Feasel

Miles Cole crunches things down:

Compaction is one the most necessary but also challenging aspects of managing a Lakehouse architecture. Similar to file systems and even relational databases, unless closely managed, data will get fragmented over time, and can lead to excessive compute costs. The OPTIMIZE command exists to solve for this challenge: small files are grouped into bins targeting a specific ideal file size and then rewritten to blob storage. The result is the same data, but contained in fewer files that are larger.

However, imagine this scenario: you have a nightly OPTIMIZE job which runs to keep your tables, all under 1GB, nicely compacted. Upon inspection of the Delta table transaction log, you find that most of your data is being rewritten after every ELT cycle, leading to expensive OPTIMIZE jobs, even though you are only changing a small portion of the overall data every night. Meanwhile, as business requirements lead to more frequent Delta table updates, in between ELT cycles, it appears that jobs get slower and slower until the next scheduled OPTIMIZE job is run. Sound familiar?

Read on to see what’s new and how you can enable it in your Fabric workspace.

Comments closed

Microsoft Fabric Spark Connector for SQL Databases

Published 2025-10-07 by Kevin Feasel

Arshad Ali makes an announcement:

Fabric Spark connector for SQL databases (Azure SQL databases, Azure SQL Managed Instances, Fabric SQL databases and SQL Server in Azure VM) in the Fabric Spark runtime is now available. This connector enables Spark developers and data scientists to access and work with data from SQL database engines using a simplified Spark API. The connector will be included as a default library within the Fabric Runtime, eliminating the need for separate installation.

This is a preview feature and works with Scala and Python code against SQL Server-ish databases in Azure (Azure SQL DB, Azure SQL Managed Instance, and virtual machines running SQL Server in Azure).

Comments closed

Error Handling in PySpark Jobs

Published 2025-10-03 by Kevin Feasel

Ram Ghadiyaram adds some error handling logic:

In PySpark, processing massive datasets across distributed clusters is powerful but comes with challenges. A single bad record, missing file, or network glitch can crash an entire job, wasting compute resources and leaving you with stack traces that have many lines.

Spark’s lazy evaluation, where transformations don’t execute until an action is triggered, makes errors harder to catch early, and debugging them can feel like very, very difficult.

Read on for five patterns that can help with error handling in PySpark.

Comments closed

Comparing Spark Application Performance in Microsoft Fabric

Published 2025-09-24 by Kevin Feasel

Jenny Jiang announces a new capability:

The Spark Applications Comparison feature is now in preview in Microsoft Fabric. This new capability empowers developers and data engineers to analyze, debug, and optimize Spark performance across multiple application runs—whether you’re tracking changes from code updates or data variations to improve performance.

The image in the blog post is pretty small and hard to read, but I do wonder if (or how well) it will capture cases where you’re twiddling your thumbs to get a machine so that you can execute your code. This seems to be a big problem sometimes.

Comments closed

Join Strategies in Apache Spark

Published 2025-09-04 by Kevin Feasel

Ram Ghadiyaram looks at three join strategies in Apache Spark:

In this article, we are going to discuss three essential joins of Apache Spark.

The data frame or table join operation is most commonly used for data transformations in Apache Spark. With Apache Spark, a developer can use joins to merge two or more data frames according to specific (sortable) keys. Writing a join operation has a straightforward syntax, but occasionally the inner workings are obscured. Apache Spark internal API suggests several algorithms for joins and selects one. A basic join operation could become costly if you do not know what these core algorithms are or which one Spark uses.

This is not a comprehensive list, but it does cover three of the more common strategies when dealing with larger datasets.

Comments closed

Auto-Scale Billing for Spark in Microsoft Fabric now GA

Published 2025-08-01 by Kevin Feasel

Santhosh Kumar Ravindran announces a feature in general availability:

We’re thrilled to announce the general availability (GA) of Autoscale Billing for Apache Spark in Microsoft Fabric — a serverless billing model designed to offer greater flexibility, transparency, and cost efficiency for running Spark workloads at scale.

With this model now fully supported, Spark Jobs can run independently of your Fabric capacity and are billed on a pay-as-you-go basis — similar to how Spark works in Azure Synapse. This gives teams the freedom to scale compute as needed without impacting other workloads running on your shared Fabric capacity.

I’m of two minds here. On the one hand, there is value to having this as an option. On the other hand, one of the talking points for Microsoft Fabric is that you have one billing model. But because it’s an optional thing you can enable rather than something you must use, I’m fine with it.

Comments closed

Optimizing Multi-Notebook Jobs in Microsoft Fabric and AWS Glue

Published 2025-07-24 by Kevin Feasel

Daniel Janik flips a switch:

Are your Azure Fabric pipelines with multiple notebooks running slower than you’d like? Are you paying for more Spark compute time than you should be? The culprit might be a simple setting that’s easy to miss. In this blog post, we’ll dive into the “For pipeline running multiple notebooks” setting in Azure Fabric and explain why enabling it can significantly improve your pipeline’s performance and reduce your costs.

Click through for this, as well as a comparison with AWS Glue and ways to perform something similar there.

Comments closed

Accessing Delta Lake via Spark Connect

Published 2025-06-23 by Kevin Feasel

Ed Elliott has some code for us:

I have just finished an update for the spark connect dotnet lib that contains the DeltaTable implementation so that we can now use .NET to maintain delta tables, over and above what we get out of the box by using DataFrame.Write.Format("delta"), this is an example of how to use the delta api from .NET:

Click through for the example, and you can learn more about Spark Connect from the Spark website and Ed has a .NET client for the task.

Comments closed

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Category: Spark