Press "Enter" to skip to content

Category: Spark

Error Handling in PySpark Jobs

Ram Ghadiyaram adds some error handling logic:

In PySpark, processing massive datasets across distributed clusters is powerful but comes with challenges. A single bad record, missing file, or network glitch can crash an entire job, wasting compute resources and leaving you with stack traces that have many lines. 

Spark’s lazy evaluation, where transformations don’t execute until an action is triggered, makes errors harder to catch early, and debugging them can feel like very, very difficult.

Read on for five patterns that can help with error handling in PySpark.

Leave a Comment

Comparing Spark Application Performance in Microsoft Fabric

Jenny Jiang announces a new capability:

The Spark Applications Comparison feature is now in preview in Microsoft Fabric. This new capability empowers developers and data engineers to analyze, debug, and optimize Spark performance across multiple application runs—whether you’re tracking changes from code updates or data variations to improve performance.

The image in the blog post is pretty small and hard to read, but I do wonder if (or how well) it will capture cases where you’re twiddling your thumbs to get a machine so that you can execute your code. This seems to be a big problem sometimes.

Leave a Comment

Join Strategies in Apache Spark

Ram Ghadiyaram looks at three join strategies in Apache Spark:

In this article, we are going to discuss three essential joins of Apache Spark.

The data frame or table join operation is most commonly used for data transformations in Apache Spark. With Apache Spark, a developer can use joins to merge two or more data frames according to specific (sortable) keys. Writing a join operation has a straightforward syntax, but occasionally the inner workings are obscured. Apache Spark internal API suggests several algorithms for joins and selects one. A basic join operation could become costly if you do not know what these core algorithms are or which one Spark uses.

This is not a comprehensive list, but it does cover three of the more common strategies when dealing with larger datasets.

Comments closed

Auto-Scale Billing for Spark in Microsoft Fabric now GA

Santhosh Kumar Ravindran announces a feature in general availability:

We’re thrilled to announce the general availability (GA) of Autoscale Billing for Apache Spark in Microsoft Fabric — a serverless billing model designed to offer greater flexibility, transparency, and cost efficiency for running Spark workloads at scale.

With this model now fully supported, Spark Jobs can run independently of your Fabric capacity and are billed on a pay-as-you-go basis — similar to how Spark works in Azure Synapse. This gives teams the freedom to scale compute as needed without impacting other workloads running on your shared Fabric capacity.

I’m of two minds here. On the one hand, there is value to having this as an option. On the other hand, one of the talking points for Microsoft Fabric is that you have one billing model. But because it’s an optional thing you can enable rather than something you must use, I’m fine with it.

Comments closed

Optimizing Multi-Notebook Jobs in Microsoft Fabric and AWS Glue

Daniel Janik flips a switch:

Are your Azure Fabric pipelines with multiple notebooks running slower than you’d like? Are you paying for more Spark compute time than you should be? The culprit might be a simple setting that’s easy to miss. In this blog post, we’ll dive into the “For pipeline running multiple notebooks” setting in Azure Fabric and explain why enabling it can significantly improve your pipeline’s performance and reduce your costs.

Click through for this, as well as a comparison with AWS Glue and ways to perform something similar there.

Comments closed

Accessing Delta Lake via Spark Connect

Ed Elliott has some code for us:

I have just finished an update for the spark connect dotnet lib that contains the DeltaTable implementation so that we can now use .NET to maintain delta tables, over and above what we get out of the box by using DataFrame.Write.Format("delta"), this is an example of how to use the delta api from .NET:

Click through for the example, and you can learn more about Spark Connect from the Spark website and Ed has a .NET client for the task.

Comments closed

Spark Streaming plus Drools

Ram Ghadiyaram builds a tool:

Near real-time decision-making systems are critical for modern business applications. Integrating Apache Spark (Streaming) and Drools provides scalability and flexibility, enabling efficient handling of rule-based decision-making at scale. This article showcases their integration through a loan approval system, demonstrating its architecture, implementation, and advantages.  

Click through for a bit of sample code.

Comments closed

What’s New in Apache Spark 4.0

Ram Ghadiyaram looks at recent updates to Apache Spark:

Hurray! Apache Spark 4.0, released in 2025, redefines big data processing with innovations that enhance performance, accessibility, and developer productivity. With contributions from over 400 developers across organizations like Databricks, Apple, and NVIDIA, Spark 4.0 resolves thousands of JIRA issues, introducing transformative features: native plotting in PySpark, Python Data Source API, polymorphic User-Defined Table Functions (UDTFs), state store enhancements, SQL scripting, and Spark Connect improvements. This report provides an in-depth exploration of these features, their technical underpinnings, and practical applications through original examples and diagrams.

Click through to see what’s on the list of major features.

Comments closed

Apache Spark 3.5 Support in Azure Synapse Analytics

Arshad Ali has an announcement:

You can now create Azure Synapse Runtime for Apache Spark 3.5. The essential changes include features which come from upgrading Apache Spark to version 3.5 and Delta Lake 3.2. Please review the official release notes for Apache Spark 3.5 to check the complete list of fixes and features. In addition, review the migration guidelines between Spark 3.4 and 3.5 to assess potential changes to your applications, jobs and notebooks. 

Credit where credit is due: I’ve made light of the utter lack of work on Azure Synapse Analytics since Microsoft Fabric’s release. But hey, they did a thing. Granted, the impetus behind this was to “prepare for migrating to Microsoft Fabric Spark.”

Comments closed

Automated Table Statistics on Delta Tables in Microsoft Fabric

Santhosh Kumar Ravindran makes an announcement:

We’re thrilled to introduce Automated Table Statistics in Microsoft Fabric Data Engineering — a major upgrade that helps you get blazing-fast query performance with zero manual effort.

Whether you’re running complex joins, large aggregations, or heavy filtering workloads, Fabric’s new automated statistics will help Spark make smarter decisions, saving you time, compute, and money.

Click through to see what’s included, as well as the limitations associated with this. You can still create manual statistics if you’d like, so on the whole, I approve.

Comments closed