Press "Enter" to skip to content

Curated SQL Posts

Avoid Exposing PostgreSQL Port 5432 to the Internet

Christophe Pettus shares some good advice:

Sometimes, we run into a client who has port 5432 exposed to the public Internet, usually as a convenience measure to allow remote applications to access the database without having to go through an intermediate server appllication.

Do not do this.

This is the equivalent of exposing port 1433 on a SQL Server instance to the broader internet, and is a bad idea for many of the same reasons.

Leave a Comment

B-Tree Indexes in PostgreSQL vs SQL Server

Lukas Fittl compares and contrasts:

When it comes to optimizing query performance, indexing is one of the most powerful tools available to database engineers. Both PostgreSQL and Microsoft SQL Server (or Azure SQL) use B-Tree indexes as their default indexing structure, but the way each system implements, maintains, and uses those indexes varies in subtle but important ways.

In this blog post, we explore key areas where PostgreSQL and SQL Server diverge: how their B-Tree indexes implementations behave under the hood and how they store and access data on disk. We’ll also benchmark the impact of deduplication of values on index size in each database system.

I love this kind of post because you hear that SQL Server has indexes and PostgreSQL has indexes (or Oracle has indexes or whatever), and thus, all of your index building knowledge in one applies to the other…right?

One thing that changes the article a bit is that the author doesn’t use page-level compression on indexes in SQL Server. I’d expect the results to change a fair amount, even if the SQL Server non-clustered indexes still ended up larger in the end than PostgreSQL indexes.

Leave a Comment

Understanding Multi-Version Concurrency Control in PostgreSQL

Grant Fritchey explains a mechanism:

Let me start by giving you the short version of what MVCC is, and then the rest of the article explains more details. Basically, PostgreSQL is focused on ensuring, as much as possible, that reads don’t block writes and writes don’t block reads. This is done by always, only, inserting rows (tuples). No updates to an existing row. No actual deletes or updates. Instead, it uses a logical delete mechanism, which we’ll get into. This means that data in motion doesn’t interfere with data at rest, meaning a write doesn’t interfere with a read, therefore, less contention & blocking. There’s a lot to how all that works, so let’s get into it.

Click through for the dive. The pattern for MVCC is interesting, though quite different from pretty much any other implementation of concurrency management in a relational database system.

Leave a Comment

Azure SQL Mirroring in Microsoft Fabric Updates

Idris Motiwala announces a set of changes:

Attention data engineers, database developers, and data analysts! We’re pumped to reveal exciting upgrades to Mirroring for Azure SQL Database in Fabric today at the Fabric Conference in Las Vegas 2025. Since it became Generally Available, Mirroring for Azure SQL Database has been a game-changer, letting you replicate data seamlessly and integrate it within the Fabric environment. We’re talking near real-time data accessibility and some serious analytics power!

Click through to see what’s new, as well as what’s upcoming. This is specifically for Azure SQL DB and Azure SQL Managed Instance, versus SQL Server running as an Azure VM or on-premises (or some other cloud), so calibrate expectations accordingly.

Leave a Comment

Fine-Tuning a DistilBERT Model for Question Answering

Muhammad Asad Iqbal Khan builds upon a simple model:

The transformers library provides a clean and well-documented interface for many popular transformer models. Not only it makes the source code easier to read and understand, it also provided a standardize way to interact with the model. You have seen in the previous post how to use a model such as DistilBERT for natural language processing tasks. In this post, you will learn how to fine-tune the model for your own purpose. This expands the use of the model from inference to training. Specifically, you will learn:

  • How to prepare the dataset for training
  • How to train a model using a helper library

DistilBERT is a major simplification of BERT, but it comes with the advantage that it’s very easy to train on modest hardware and performance is in the same realm of acceptability as the full BERT model. Switching from DistilBERT to BERT isn’t as easy as just swapping out model classes, though it’s pretty close.

Leave a Comment

The Power of Virtual Environments in Python

I have a new video:

In this video, I explain why virtual environments are such an important concept in Python and why you should generally be using them. I also talk about virtual environments versus Docker containers and how these are not mutually exclusive.

It took me a while to understand why virtual environments make sense, and I think part of the difficulty in adapting to this mental model was that I was used to the .NET mechanism for package management: per-project library installation. Sure, there was the Global Assembly Cache (GAC) in .NET Framework and that had similar problems to installing packages in base Python installations, but we didn’t use it that often. Or at least, I’ve sublimated however many hours of pain I fought the GAC to the point that I don’t remember them anymore.

Leave a Comment

The New Fabric CLI

Hasan Abo Shally announces a CLI:

  • The Fabric CLI is now in preview
  • It offers a developer-first, file-system-inspired way to explore and manage Microsoft Fabric
  • Use it interactively or script it into your workflows — from your terminal, in seconds
  • Built on Fabric APIs, designed for automation, and constantly evolving
  • Open source is on the horizon — with plans to empower the community to extend and shape the CLI

Give it a try. Break things. Tell us what you want next.

Click through for the full announcement. The idea here is to be the az cli for Fabric. Between this and Semantic Link Labs, it will make automating tasks in Microsoft Fabric easier.

Leave a Comment

A New Dashboard for Distributed Availability Groups

David Fowler has been busy:

This comes off of the back of my last post looking at using a distributed availability group (DAG) to help facilitate a SQL server migration. SQL Server Migration Using a Distributed Availability Group

One thing that I mentioned in that post was that, although SSMS gives us a nice dashboard to check the health of our regular AGs. There’s nothing there to look at the state that the DAGs are in. The only choice that we’ve got is to tap up and compare results from a couple of DMVs on each side.

David has met that demand. Read on to see what the solution includes and how you can get your hands on it.

Leave a Comment

Calling a Microsoft Fabric REST API via Azure Data Factory

Koen Verbeeck makes the call:

Suppose you want to call a certain Microsoft Fabric REST API endpoint from Azure Data Factory (or Synapse Pipelines). This can be done using a Web Activity, and most Fabric APIs now support service principals or managed identities. Let’s illustrate with an example. I’m going to call the REST API endpoint to create a new lakehouse. 

Click through for the instructions.

Leave a Comment

Deploying and Using Custom Python Libraries in Microsoft Fabric

Miles Cole picks up from part one:

This is part 2 of my prior post that continues where I left off. I previously showed how you can use Resource folders in either the Notebook or Environment in Microsoft Fabric to do some pretty agile development of Python modules/libraries.

Now, how exactly can you package up your code to distribute and leverage it across multiple Workspaces or Environment items? How could we acomplish something like the below?

Read on for the answer.

Leave a Comment