Spark – Page 3 – Curated SQL

Session, DataFrameWriter, and Table Configurations in Spark

Published 2024-12-24 by Kevin Feasel

Miles Cole makes a configuration change:

With Spark and Delta Lake, just like with Hudi and Iceberg, there are several ways to enable or disable settings that impact how tables are created. These settings may affect data layout or table format features, but it can be confusing to understand why different methods exist, when each should be used, and how property inheritance works.

While platform defaults should account for most use cases, Spark provides flexibility to optimize various workloads, whether adjusting for read or write performance, or for hot or cold path data processing. Inevitably, the need to adjust configurations from the default will arise. So, how do we do this effectively?

Read on to learn how.

Comments closed

Switching between Python and PySpark Notebooks in Fabric

Published 2024-12-23 by Kevin Feasel

Sandeep Pawar wants to save some money:

File this under a test I have been wanting to do for some time. If I am exploring some data in a Fabric notebook using PySpark, can I switch between Python and PySpark engines with minimal code changes in an interactive session? The goal is to use the Python notebook for some exploration or use existing PySpark/SparkSQL or develop the logic in a low compute environment (to save CUs) and scale it in a distributed Spark environment. Understandably, there will be limitations with this approach given the difference in environments, configs etc., but can it be done?

Read on for the answer, as well as plenty of notes around it.

Comments closed

The Showdown: Spark vs DuckDB vs Polars in Microsoft Fabric

Published 2024-12-13 by Kevin Feasel

Miles Cole puts together a benchmark:

There’s been a lot of excitement lately about single-machine compute engines like DuckDB and Polars. With the recent release of pure Python Notebooks in Microsoft Fabric, the excitement about these lightweight native engines has risen to a new high. Out with Spark and in with the new and cool animal-themed engines— is it time to finally migrate your small and medium workloads off of Spark?

Before writing this blog post, honestly, I couldn’t have answered with anything besides a gut feeling largely based on having a confirmation bias towards Spark. With recent folks in the community posting their own benchmarks highlighting the power of these lightweight engines, I felt it was finally time to pull up my sleeves and explore whether or not I should abandon everything I know and become a DuckDB and/or Polars convert.

Read on for the method and results from several thoughtful tests.

Comments closed

Updating Spark Pool Runtime Versions in Microsoft Fabric

Published 2024-12-12 by Kevin Feasel

Sandeep Pawar keeps things up to date:

It’s always a good idea to use the latest GA runtime for the default Spark pool in Fabric workspaces. Unless you change it manually, the workspace will always use the previously set runtime even if a new version is available. To help identify the runtime that workspaces are using and to upgrade multiple workspaces at once, use the code below, powered by Semantic Link.

Read on to see how you can do it using a bit of Python scripting.

Comments closed

Metadata-Driven Spark Clusters in Azure Databricks

Published 2024-12-10 by Kevin Feasel

Matt Collins ties the room together with a bit of metadata:

In this article, we will discuss some options for improving interoperability between Azure Orchestration tools, like Data Factory, and Databricks Spark Compute. By using some simple metadata, we will show how to dynamically configure pipelines with appropriately sized clusters for all your orchestration and transformation needs as part of a data analytics platform.

Click through for an explanation of the challenge, followed by the how-to.

Comments closed

Debugging in Databricks

Published 2024-11-19 by Kevin Feasel

Chen Hirsh enables a debugger:

Do you know that feeling, when you write beautiful code and everything just works perfectly on the first try?

I don’t.

Every time I write code It doesn’t work in the beginning, and I have to debug it, make changes, test it…

Databricks introduced a debugger you can use on a code cell, and I’ve wanted to try it for quite some time now. Well, I guess the time is now

I’m having trouble in finding the utility for a debugger here. Notebooks are already set up for debugging: you can easily add or remove cells and the underlying session maintains state between cells.

Comments closed

Updates on the Spark Connect Dotnet Library

Published 2024-11-18 by Kevin Feasel

Ed Elliott has an update for us:

There have been quite a few changes in the last couple of months and I just wanted to give a quick update on the current state of the project. In terms of usage I am starting to hear from people using the library and submitting pr’s and requests so although usage is pretty low (which is expected from the fact that the Microsoft supported version usage wasn’t very high) it is growing which is interesting.

Read on for thoughts on production readiness, support for Spark 4.0, a couple of other updates, and some future plans.

Comments closed

A Primer on SparkSQL and PySpark

Published 2024-11-06 by Kevin Feasel

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Comments closed

Enabling System Tables on Databricks

Published 2024-11-04 by Kevin Feasel

Chen Hirsh wants to see system tables:

This post is about two things that are dependent on each other. First, it explains how to enable system tables (other than the one enabled by default) and second, how to use these system tables to view data about workflow runs and costs.

Please note that Unity Catalog is required to use these features. And a premium workspace is required for using dashboards.

Click through to learn more about what system tables are and what you can get from them.

Comments closed

Finding Mutable and Immutable Properties in Microsoft Fabric Spark

Published 2024-11-01 by Kevin Feasel

Sandeep Pawar wants to make a change:

Spark properties are divided into mutable and immutable configurations based on whether they can be safely modified during runtime after the spark session is created.

Mutable properties can be changed dynamically using spark.conf.set() without requiring a restart of the Spark application – these typically include performance tuning parameters like shuffle partitions, broadcast thresholds, AQE etc.

Immutable properties, on the other hand, are global configurations that affect core spark behavior and cluster setup and these must be set before/at session initialization as they require a fresh session to take effect.

Read on to see how you can tell which is which.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: Spark