Press "Enter" to skip to content

Author: Kevin Feasel

Accounting for Index Usage over Availability Groups

Aaron Bertrand wants a cross-replica accounting system:

As a part of optimizing performance, I evaluate index usage across many instances and databases. I often find that some indexes aren’t used much or, at least on first glance, appear to go unused. Since we use availability groups (AG), different workloads run against replicas in different roles. All writes happen on the primary, obviously. However, some queries only happen on read-only secondaries (either because read-only routing is in use, or because some processes are manually directed at specific secondaries, or both). Unfortunately, index usage is not rolled up anywhere across all replicas. This means that looking at the primary alone gives an incomplete picture. How do I make sure I account for index activity everywhere, not just on the primary?

This is part one of a series but already has enough to get us started.

Leave a Comment

Updates to Fabric Data Factory

Abhishek Narain has a list of updates:

Workspace Private Link Support for Data Factory (Preview): Microsoft Fabric enables secure data integration through Private Link support in Dataflows Gen2, Pipelines, and Copy jobs. This ensures that inbound data access remains isolated and compliant within protected workspaces. By leveraging VNet data gateways, organizations can securely connect to data sources across Private Link-enabled environments—eliminating exposure to public networks and reinforcing enterprise-grade security for sensitive data operations.

Most of these are security-related updates, with a mixture of things now GA, things currently in preview, and a pair of items coming soon.

Leave a Comment

Error Handling in PySpark Jobs

Ram Ghadiyaram adds some error handling logic:

In PySpark, processing massive datasets across distributed clusters is powerful but comes with challenges. A single bad record, missing file, or network glitch can crash an entire job, wasting compute resources and leaving you with stack traces that have many lines. 

Spark’s lazy evaluation, where transformations don’t execute until an action is triggered, makes errors harder to catch early, and debugging them can feel like very, very difficult.

Read on for five patterns that can help with error handling in PySpark.

Leave a Comment

Performance of User-Defined Functions in Fabric Warehouses

Jared Westover shares some findings:

In Part One, we saw that simple scalar user-defined functions (UDFs) perform as well as inline code in a Fabric warehouse. But with a more complex UDF, does performance change? If it drops, is the code-reuse convenience worth the price?

I’m surprised that the performance profile was so good. I had assumed it would perform like T-SQL user-defined functions—namely, worse in general.

Leave a Comment

Building an UPDATE … LIMIT in PostgreSQL

Laurenz Albe doesn’t have MySQL envy:

If you are reading this hoping that PostgreSQL finally got UPDATE ... LIMIT like MySQL, I have to disappoint you. The LIMIT clause is not yet supported for DML statements in PostgreSQL. If you want to UPDATE only a limited number of rows, you have to use workarounds. This article will describe how to do this and how to avoid the pitfalls and race condition you may encounter. Note that most of the following also applies to DELETE ... LIMIT!

Click through for what you can do in PostgreSQL instead. In T-SQL, we can use UPDATE TOP(n).

Leave a Comment

Creating Database Snapshots with sp_snapshot

David Fowler announces a tool update:

Presenting you with an updated version of our sp_snapshot procedure, allowing you to easily create database snapshots.

This new version fixes a bug that we’ve found in version 2 where snapshots will fail for databases with multiple data files.

We’ve also added the @STMTOnly parameter, allowing you to generate the scripts for creating the required snapshots without actually doing so.

Click through for more information, as well as where you can go to download the script.

Leave a Comment

RCSI Scenarios

Haripriya Naidu digs into a few scenarios:

When RCSI is enabled, I’d like to discuss three scenarios:

  1. UPDATE is in progress, and SELECT starts to run.
    Where does SELECT read from?
  2. SELECT runs, and there are no concurrent operations or uncommitted transactions. Where does SELECT read from?
  3. SELECT is running, and now an UPDATE starts to run concurrently. What happens to SELECT that is in progress? What about the UPDATE that started? Does it wait for SELECT to finish?

Click through to see what happens during each of these scenarios.

Leave a Comment

Choosing between Data Scalers in a Data Science Project

Bala Pirya C performs a comparison:

In this article, you will learn how MinMaxScaler, StandardScaler, and RobustScaler transform skewed, outlier-heavy data, and how to pick the right one for your modeling pipeline.

Topics we will cover include:

  • How each scaler works and where it breaks on skewed or outlier-rich data
  • A realistic synthetic dataset to stress-test the scalers
  • A practical, code-ready heuristic for choosing a scaler

Read on to learn more about each of these three scaler types, the use cases that best fit each of them, and even a flow chart at the end.

Leave a Comment

A Primer on Principal Component Analysis

Harris Amjad explains the basics of principal component analysis:

In this series of tips, we will delve into the unsupervised learning branch of Machine Learning. Principal Component Analysis (PCA) is a powerful technique for dimensionality reduction, but its mathematical foundation involving eigenvalues and eigenvectors can be intimidating. This tip aims to demystify PCA, explaining its purpose, how it works, and its use in visualizing high-dimensional data.

Click through to learn how it works. This is a solid primer.

Leave a Comment