2025-10-03 – Curated SQL

Error Handling in PySpark Jobs

Published 2025-10-03 by Kevin Feasel

Ram Ghadiyaram adds some error handling logic:

In PySpark, processing massive datasets across distributed clusters is powerful but comes with challenges. A single bad record, missing file, or network glitch can crash an entire job, wasting compute resources and leaving you with stack traces that have many lines.

Spark’s lazy evaluation, where transformations don’t execute until an action is triggered, makes errors harder to catch early, and debugging them can feel like very, very difficult.

Read on for five patterns that can help with error handling in PySpark.

Comments closed

Performance of User-Defined Functions in Fabric Warehouses

Published 2025-10-03 by Kevin Feasel

Jared Westover shares some findings:

In Part One, we saw that simple scalar user-defined functions (UDFs) perform as well as inline code in a Fabric warehouse. But with a more complex UDF, does performance change? If it drops, is the code-reuse convenience worth the price?

I’m surprised that the performance profile was so good. I had assumed it would perform like T-SQL user-defined functions—namely, worse in general.

Comments closed

Building an UPDATE … LIMIT in PostgreSQL

Published 2025-10-03 by Kevin Feasel

Laurenz Albe doesn’t have MySQL envy:

If you are reading this hoping that PostgreSQL finally got UPDATE ... LIMIT like MySQL, I have to disappoint you. The LIMIT clause is not yet supported for DML statements in PostgreSQL. If you want to UPDATE only a limited number of rows, you have to use workarounds. This article will describe how to do this and how to avoid the pitfalls and race condition you may encounter. Note that most of the following also applies to DELETE ... LIMIT!

Click through for what you can do in PostgreSQL instead. In T-SQL, we can use UPDATE TOP(n).

Comments closed

RCSI Scenarios

Published 2025-10-03 by Kevin Feasel

Haripriya Naidu digs into a few scenarios:

When RCSI is enabled, I’d like to discuss three scenarios:

UPDATE is in progress, and SELECT starts to run.
Where does SELECT read from?

SELECT runs, and there are no concurrent operations or uncommitted transactions. Where does SELECT read from?

SELECT is running, and now an UPDATE starts to run concurrently. What happens to SELECT that is in progress? What about the UPDATE that started? Does it wait for SELECT to finish?

Click through to see what happens during each of these scenarios.

Comments closed

Creating Database Snapshots with sp_snapshot

Published 2025-10-03 by Kevin Feasel

David Fowler announces a tool update:

Presenting you with an updated version of our sp_snapshot procedure, allowing you to easily create database snapshots.

This new version fixes a bug that we’ve found in version 2 where snapshots will fail for databases with multiple data files.

We’ve also added the @STMTOnly parameter, allowing you to generate the scripts for creating the required snapshots without actually doing so.

Click through for more information, as well as where you can go to download the script.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Day: October 3, 2025

Error Handling in PySpark Jobs

Performance of User-Defined Functions in Fabric Warehouses

Building an UPDATE … LIMIT in PostgreSQL

RCSI Scenarios

Creating Database Snapshots with sp_snapshot