Month: October 2025

When repartition() Beats coalesce() in Spark

Published 2025-10-31 by Kevin Feasel

Janani Annur Thiruvengadam stands some common advice on its head:

If you’ve worked with Apache Spark, you’ve probably heard the conventional wisdom: “Use coalesce() instead of repartition() when reducing partitions — it’s faster because it avoids a shuffle.” This advice appears in documentation, blog posts, and is repeated across Stack Overflow threads. But what if I told you this isn’t always true?

In a recent production workload, I discovered that using repartition() instead of coalesce() resulted in a 33% performance improvement (16 minutes vs. 23 minutes) when writing data to fewer partitions. This counterintuitive result reveals an important lesson about Spark’s Catalyst optimizer that every Spark developer should understand.

Read on for the details on that scenario.

Comments closed

Dealing with Dirty Pages in PostgreSQL

Published 2025-10-31 by Kevin Feasel

Umair Shahid explains what dirty pages are in PostgreSQL:

PostgreSQL stores data in fixed‑size blocks (pages), normally 8 KB. When a client updates or inserts data, PostgreSQL does not immediately write those changes to disk. Instead, it loads the affected page into shared memory (shared buffers), makes the modification there, and marks the page as dirty. A “dirty page” means the version of that page in memory is newer than the on‑disk copy.

Before returning control to the client, PostgreSQL records the change in the Write‑Ahead Log (WAL), ensuring durability even if the database crashes. However, the actual table file isn’t updated until a checkpoint or background writer flushes the dirty page to disk.

The concept is the same in SQL Server. Read on to see how PostgreSQL manages dirty pages and some of the issues you might run into with them.

Comments closed

Tips for Running SQL Server on Hyper-V

Published 2025-10-31 by Kevin Feasel

Mike Walsh shares some advice:

If you’ve ever asked yourself, “Why does my SQL Server seem slower on Hyper-V than it should be?!”, this post might help. And if you asked me about Hyper-V ten years ago, I’d probably have laughed. Maybe even less than that. But here’s the thing: it scales, it works, and with the Broadcom/VMware “fun” squeezing the world for profit, we’re seeing more and more folks making the move to Hyper-V. You have to pay attention to some key configurations, just as with VMware..

None of this is radical advice, but it is good to make sure that you have the basics covered because these can make a difference.

Comments closed

Migrating from PSCore 6 to PowerShell 7.5

Published 2025-10-31 by Kevin Feasel

Adam Bertram lays out what has changed:

You adopted PowerShell Core 6 early. You moved scripts to .NET Core. You dealt with the compatibility issues. Now Microsoft wants you to upgrade again.

Here’s why it matters: PowerShell Core 6 is no longer supported. Your scripts still run, but you’re missing security patches, performance improvements, and features that make PowerShell 7.5 worth the upgrade.

The good news? Moving from Core 6 to 7.5 is easier than the jump from Windows PowerShell. Most scripts work unchanged. But “most” isn’t “all,” and the differences matter.

Read on to see what will break when moving from PowerShell Core 6 to PowerShell 7.5.

Comments closed

Generating a DAXX File for Performance Tuning

Published 2025-10-31 by Kevin Feasel

Phil Seamark does some troubleshooting:

When troubleshooting slow DAX queries, sharing the right diagnostic information with an expert can make all the difference. That’s where a DAXX file comes in. This special file format is created using DAX Studio. It bundles essential metadata and performance details without exposing query results. It’s perfect for collaborative optimisation.

Read on to learn more about what a DAXX file is and how it can be useful in the performance tuning process.

Comments closed

Random Number Generation in T-SQL via Marsaglia Polar Method

Published 2025-10-31 by Kevin Feasel

Sebastiao Pereira implements a method for generating random numbers in T-SQL:

Generating random numbers from a normal distribution is essential for accuracy and realistic modeling, simulation, inference, and algorithm design for scientific, engineering, statistical, and AI domains. How can we build a random number generator using Marsaglia Polar method in SQL Server without the use of external tools?

It’s an interesting technique that works well for drawing points from a two-dimensional space around a point.

Comments closed

Simple Data Quality Validation with T-SQL

Published 2025-10-30 by Kevin Feasel

Kenneth Omorodion builds a validation process:

As the need and use of data grows within any organization, there is a corresponding rising issue for the need of data quality validation. Most organizations have large stores of data but most of it are not managed efficiently in terms of data quality assurances, thus leading to inaccurate insights for the business which in turn leads to distrust in the data.

Organizations have now, more than ever, realized the importance of an efficient data quality process as part of their Business Intelligence and Analytics processes. The issue is, how can they implement data quality for their data? For larger and more data-centric organizations, they might be using pre-built data management and validation tools like Microsoft Purview or other Master Data Management tools like Informatica, Talend, SAP, Talend, and Stibo Systems. But for those organizations that can not commit to subscribing to pre-built options, or they are operating primarily on On-Premises environments, they might want to build one themselves, that’s where this article comes in.

I’m of two minds about this. One the one hand, I appreciate the effort that Kenneth put into this and expect that it would work reasonably well. On the other hand, I look at what it can do and say “Yes, but if you just use constraints like you should, then you don’t need this process.” It’s basically a very asynchronous way of applying check constraints, foreign key constraints, and specifying that an attribute is NOT NULL.

If there’s some reason why applying these constraints is impossible—such as receiving this data as an extract from some poorly-designed system—then this can do a good job of describing the failures of the upstream system. But this is where data architects need to get it right up-front.

1 Comment

Microsoft Fabric October 2025 Feature Summary

Published 2025-10-30 by Kevin Feasel

Adam Saxton has a list:

This month’s update delivers key advancements across Microsoft Fabric, including enhanced security with Outbound Access Protection and Workspace-Level Private Link, smarter data engineering features like Adaptive Target File Size, and new integrations such as Data Agent in Lakehouse. Together, these improvements streamline workflows and strengthen data governance for users.

The list doesn’t feel quite as long as the prior couple of months, but there’s still a lot of content on here.

Comments closed

CI/CD Pipelines with DACPACS Allowing Potential Data Loss

Published 2025-10-30 by Kevin Feasel

Jess Pomfret describes a use case:

I recently made a change to a database schema that involved removing a column from a table. After we recently improved our authentication process to this tool we no longer needed a password field (don’t worry it was not holding plain text passwords!).

I made the change to the Users.sql file, built the project to confirm everything was good, and committed the change. But, on deployment my pipeline failed and this was the error The schema update is terminating because data loss might occur.

Read on to see how Jess dealt with this scenario.

Comments closed

Using Temporary Stored Procedures to Output Common Messages

Published 2025-10-30 by Kevin Feasel

Louis Davidson shows a neat use for temporary stored procedures:

On another connection (on another computer for that matter), I am right no doing some pretty long loads of some test data. The script is comprised of 6 queries, and they each may take 10 minutes (not completely sure, this is my first run of the scripts). And of course, I want to get some feedback on these queries to know how long they are taking.

One common way to do this is to put a PRINT statement between the queries so you can see the progress. But PRINT statements are notorious for one thing. Caching the output until the output buffer reaches some level.

One addition I’d make to Louis’s post is to make use of the FORMATMESSAGE() functionality that SQL Server 2016 introduced. This use case is right in its wheelhouse.

    SET @message = FORMATMESSAGE(N'%s%s%s%s',
			@Message,
			CASE
				WHEN @AddTimeToMessageFlag = 1 THEN CONCAT(N' : Message Time - ', SYSDATETIME())
				ELSE N''
			END,
			CASE
				WHEN @AddSpidToMessageFlag = 1 THEN CONCAT(N' : ProcessId - ', @@spid)
				ELSE N''
			END,
			CASE
				WHEN @AddOriginalLoginToOutputFlag = 1 THEN CONCAT(N' : LoggedInUserId - ', original_login())
				ELSE N''
			END);

FORMATMESSAGE() provides a moderate benefit to readability versus a lengthy CONCAT(). And if you always wanted to emit all fields versus the optional setup that Louis has in place, FORMATMESSAGE() makes the result even clearer to understand.

SET @message = FORMATMESSAGE(N'%s : Message Time - %s : ProcessId - %i : LoggedInuserId - %s',
			@Message,
			CAST(SYSDATETIME() AS NVARCHAR(100)),
			@@spid,
			original_login());

3 Comments

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31