Press "Enter" to skip to content

Category: Spark

Creating a Spark Job Definition

Miles Cole builds a job:

A Spark Job Definition is effectively a way to run a packaged Spark application, Fabric’s version of executing a spark-submit job. You define:

  • what code should run (the entry point),
  • which files or resources should be shipped with it,
  • and which command-line arguments should control its behavior.

Unlike a notebook, there is no interactive editor or cell output, but this is arguably not a missing feature, it’s the whole point… an SJD is not meant for exploration; it is meant to deterministically run a Spark application.

With that concept in mind, click through for the process.

Leave a Comment

Combining Fabric Real-Time Intelligence, Notebooks, and Spark Structured Streaming

Arindam Chatterjee and QiXiao Wang show off some preview functionality:

Building event-driven, real-time applications using Fabric Eventstreams and Spark Notebooks just got a whole lot easier. With the Preview of Spark Notebooks and Real-Time Intelligence integration — a new capability that brings together the open-source community supported richness of Spark Structured Streaming with the real-time stream processing power of Fabric Eventstreams — developers can now build low-latency, end-to-end real-time analytics and AI pipelines all within Microsoft Fabric.

You can now seamlessly access streaming data from Eventstreams directly inside Spark notebooks, enabling real-time insights and decision-making without the complexity & tediousness of manual coding and configuration.

Click through to learn more.

Leave a Comment

More Spark Jobs, Fewer Notebooks

Miles Cole lays out an argument:

I’m guilty. I’ve peddled the #NotebookEverything tagline more than a few times.

To be fair, notebooks are an amazing entry point to coding, documentation, and exploration. But this post is dedicated to convincing you that notebooks are not, in fact, everything, and that many production Spark workloads would be better executed as a non-interactive Spark Job.

Miles has a “controversial claim” at the end that I don’t think is particularly controversial at all. I agree with pretty much the entire article, especially around the difficulties of testing notebooks properly.

Leave a Comment

Efficient Sampling of Spark Datasets

Rajesh Vakkalagadda needs a sample:

Sampling is a fundamental process in machine learning that involves selecting a subset of data from a larger dataset. This technique is used to make training and evaluation more efficient, especially when working with massive datasets where processing every data point is impractical

However, sampling comes with its own challenges. Ensuring that samples are representative is crucial to prevent biases that could lead to poor model generalization and inaccurate evaluation results. The sample size must strike a balance between performance and resource constraints. Additionally, sampling strategies need to account for factors such as class imbalance, temporal dependencies, and other dataset-specific characteristics to maintain data integrity.

Click through for an answer in Scala. The Python implementation would be very similar,

Comments closed

Writing Sparse Pandas DataFrames to S3

Pooja Chhabra tries a few things:

If you’ve worked with large-scale machine learning pipelines, you must know one of the most frustrating bottlenecks isn’t always found in the complexity of the model or the elegance of the architecture — it’s writing the output efficiently.

Recently, I found myself navigating a complex data engineering hurdle where I needed to write a massive Pandas sparse DataFrame — the high-dimensional output of a CountVectorizer — directly to Amazon S3. By massive, I mean tens of gigabytes of feature data stored in a memory-efficient sparse format that needed to be materialized as a raw CSV file. This legacy requirement existed because our downstream machine learning model was specifically built to ingest only that format, leaving us with a significant I/O challenge that threatened to derail our entire processing timeline.

Read on for two major constraints, a variety of false starts, and what eventually worked.

Comments closed

When repartition() Beats coalesce() in Spark

Janani Annur Thiruvengadam stands some common advice on its head:

If you’ve worked with Apache Spark, you’ve probably heard the conventional wisdom: “Use coalesce() instead of repartition() when reducing partitions — it’s faster because it avoids a shuffle.” This advice appears in documentation, blog posts, and is repeated across Stack Overflow threads. But what if I told you this isn’t always true?

In a recent production workload, I discovered that using repartition() instead of coalesce() resulted in a 33% performance improvement (16 minutes vs. 23 minutes) when writing data to fewer partitions. This counterintuitive result reveals an important lesson about Spark’s Catalyst optimizer that every Spark developer should understand.

Read on for the details on that scenario.

Comments closed

Job-Level Bursting in Microsoft Fabric Spark Jobs

Santhosh Kumar Ravindran announces a new feature:

  • Enabled (Default): When enabled, a single Spark job can leverage the full burst limit, consuming up to 3× CUs. This is ideal for demanding ETL processes or large analytical tasks that benefit from maximum immediate compute power.
  • Disabled: If you disable this switch, individual Spark jobs will be capped at the base capacity allocation. This prevents a single job from monopolizing the burst capacity, thereby preserving concurrency and improving the experience for multi-user, interactive scenarios.

Read on for the list of caveats and the note that it will cost extra money to flip that switch.

Comments closed

Optimized Compaction in Microsoft Fabric Spark

Miles Cole crunches things down:

Compaction is one the most necessary but also challenging aspects of managing a Lakehouse architecture. Similar to file systems and even relational databases, unless closely managed, data will get fragmented over time, and can lead to excessive compute costs. The OPTIMIZE command exists to solve for this challenge: small files are grouped into bins targeting a specific ideal file size and then rewritten to blob storage. The result is the same data, but contained in fewer files that are larger.

However, imagine this scenario: you have a nightly OPTIMIZE job which runs to keep your tables, all under 1GB, nicely compacted. Upon inspection of the Delta table transaction log, you find that most of your data is being rewritten after every ELT cycle, leading to expensive OPTIMIZE jobs, even though you are only changing a small portion of the overall data every night. Meanwhile, as business requirements lead to more frequent Delta table updates, in between ELT cycles, it appears that jobs get slower and slower until the next scheduled OPTIMIZE job is run. Sound familiar?

Read on to see what’s new and how you can enable it in your Fabric workspace.

Comments closed

Microsoft Fabric Spark Connector for SQL Databases

Arshad Ali makes an announcement:

Fabric Spark connector for SQL databases (Azure SQL databases, Azure SQL Managed Instances, Fabric SQL databases and SQL Server in Azure VM) in the Fabric Spark runtime is now available. This connector enables Spark developers and data scientists to access and work with data from SQL database engines using a simplified Spark API. The connector will be included as a default library within the Fabric Runtime, eliminating the need for separate installation.

This is a preview feature and works with Scala and Python code against SQL Server-ish databases in Azure (Azure SQL DB, Azure SQL Managed Instance, and virtual machines running SQL Server in Azure).

Comments closed

Error Handling in PySpark Jobs

Ram Ghadiyaram adds some error handling logic:

In PySpark, processing massive datasets across distributed clusters is powerful but comes with challenges. A single bad record, missing file, or network glitch can crash an entire job, wasting compute resources and leaving you with stack traces that have many lines. 

Spark’s lazy evaluation, where transformations don’t execute until an action is triggered, makes errors harder to catch early, and debugging them can feel like very, very difficult.

Read on for five patterns that can help with error handling in PySpark.

Comments closed