Press "Enter" to skip to content

Category: Spark

The Showdown: Spark vs DuckDB vs Polars in Microsoft Fabric

Miles Cole puts together a benchmark:

There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark. With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert.

Read on for the method and results from several thoughtful tests.

Leave a Comment

Updating Spark Pool Runtime Versions in Microsoft Fabric

Sandeep Pawar keeps things up to date:

It’s always a good idea to use the latest GA runtime for the default Spark pool in Fabric workspaces. Unless you change it manually, the workspace will always use the previously set runtime even if a new version is available. To help identify the runtime that workspaces are using and to upgrade multiple workspaces at once, use the code below, powered by Semantic Link.

Read on to see how you can do it using a bit of Python scripting.

Leave a Comment

Metadata-Driven Spark Clusters in Azure Databricks

Matt Collins ties the room together with a bit of metadata:

In this article, we will discuss some options for improving interoperability between Azure Orchestration tools, like Data Factory, and Databricks Spark Compute. By using some simple metadata, we will show how to dynamically configure pipelines with appropriately sized clusters for all your orchestration and transformation needs as part of a data analytics platform.

Click through for an explanation of the challenge, followed by the how-to.

Leave a Comment

Debugging in Databricks

Chen Hirsh enables a debugger:

Do you know that feeling, when you write beautiful code and everything just works perfectly on the first try?

I don’t.

Every time I write code It doesn’t work in the beginning, and I have to debug it, make changes, test it…

Databricks introduced a debugger you can use on a code cell, and I’ve wanted to try it for quite some time now. Well, I guess the time is now 

I’m having trouble in finding the utility for a debugger here. Notebooks are already set up for debugging: you can easily add or remove cells and the underlying session maintains state between cells.

Comments closed

Updates on the Spark Connect Dotnet Library

Ed Elliott has an update for us:

There have been quite a few changes in the last couple of months and I just wanted to give a quick update on the current state of the project. In terms of usage I am starting to hear from people using the library and submitting pr’s and requests so although usage is pretty low (which is expected from the fact that the Microsoft supported version usage wasn’t very high) it is growing which is interesting.

Read on for thoughts on production readiness, support for Spark 4.0, a couple of other updates, and some future plans.

Comments closed

A Primer on SparkSQL and PySpark

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Comments closed

Enabling System Tables on Databricks

Chen Hirsh wants to see system tables:

This post is about two things that are dependent on each other. First, it explains how to enable system tables (other than the one enabled by default) and second, how to use these system tables to view data about workflow runs and costs.

Please note that Unity Catalog is required to use these features. And a premium workspace is required for using dashboards.

Click through to learn more about what system tables are and what you can get from them.

Comments closed

Finding Mutable and Immutable Properties in Microsoft Fabric Spark

Sandeep Pawar wants to make a change:

Spark properties are divided into mutable and immutable configurations based on whether they can be safely modified during runtime after the spark session is created.

Mutable properties can be changed dynamically using spark.conf.set() without requiring a restart of the Spark application – these typically include performance tuning parameters like shuffle partitions, broadcast thresholds, AQE etc.

Immutable properties, on the other hand, are global configurations that affect core spark behavior and cluster setup and these must be set before/at session initialization as they require a fresh session to take effect.

Read on to see how you can tell which is which.

Comments closed

Power BI Automatic Aggregations and Databricks

Katie Cummiskey, et al, do a bit of caching:

Automatic aggregations streamline the process of improving BI query performance by maintaining an in-memory cache of aggregated data. This means that a substantial portion of report queries can be served directly from this in-memory cache instead of relying on the backend data sources. Power BI automatically builds these aggregations using AI based on your query patterns and then intelligently decides which queries can be served from the in-memory cache and which are routed to the data source through DirectQuery, resulting in faster visualizations and reduced load on the backend systems.

Click through to learn more about automatic aggregations, which SKUs of Power BI / Fabric are eligible, and how you can enable it for data coming from Databricks.

Comments closed

Map and FlatMap in PySpark

Vipul Kumar does a bit of work with resilient distributed datasets:

PySpark, the Python API for Apache Spark, is widely used for big data processing and distributed computing. It enables data engineers and data scientists to efficiently process large datasets using resilient distributed datasets (RDDs) and DataFrames. Two commonly used transformations in PySpark are map() and flatMap(). These functions allow users to perform operations on RDDs and are pivotal in distributed data processing.

In this blog, we’ll explore the key differences between map() and flatMap(), their use cases, and how they can be applied in PySpark.

The DataFrame approach has all but obviated having developers use the original Hadoop-like map-reduce approach to writing code in Spark. Even so, I do think it’s useful to know how it all works.

Comments closed