Press "Enter" to skip to content

Curated SQL Posts

Building a Graph for Its Takeaway

Cole Nussbaumer Knaflic reminds us that visuals should have purpose:

I was facilitating a workshop recently when someone asked one of my favorite questions about a graph on the screen: “So… what are we supposed to take away from this?”

Such a simple—and useful—question.

One challenge was that the graph was attempting to show multiple comparisons at once, so it wasn’t clear what mattered most. To further complicate things, the data in question spanned very different magnitudes, with one category dwarfing the rest.

Click through for a demonstration and how changing the visual layout can affect the message. The challenge I tend to run into is that, when I’m developing a visual for an application or a report, I don’t know what the precise message should be at that moment in time. I have to design with an idea of the data, but what actually emerges will depend upon what data is in there. Tailoring a visual for a specific message at a specific point in time is a lot easier when building a presentation, but it gets tricky when you’re building an application for the long haul.

Leave a Comment

The Makeup of an 8KB Page in PostgreSQL

Radim Marek takes us through the structure of a page:

If you read previous post about buffers, you already know PostgreSQL might not necessarily care about your rows. You might be inserting a user profile, or retrieving payment details, but all that Postgres works with are blocks of data. 8KB blocks, to be precise. You want to retrieve one tiny row? PostgreSQL hauls an entire 8,192-byte page off the disk just to give it to you. You update a single boolean flag? Same thing. The 8KB page is THE atomic unit of I/O.

But knowing those pages exist isn’t enough. To understand why the database behaves the way it does, you need to understand how it works. Every time you execute INSERT, PostgreSQL needs to figure out how to fit it into one of those 8,192-byte pages.

It is a little wild how three of the largest relational database systems use 8KB pages. I know that, on the SQL Server side, they’ve experimented with different page sizes internally and have repeatedly said that, even recently (the last time I heard this was maybe about 3 years ago at a SQL Saturday), there just isn’t a benefit from moving away from 8KB. But what’s in those 8KB differ, and Radim goes into details on what’s in PostgreSQL.

Leave a Comment

SQL Server Transaction Log Architecture

Paul Randal re-continues a series:

In the first part of this series I introduced basic terminology around logging so I recommend you read that before continuing with this post. Everything else I’ll cover in the series requires knowing some of the architecture of the transaction log, so that’s what I’m going to discuss this time. Even if you’re not going to follow the series, some of the concepts I’m going to explain below are worth knowing for everyday tasks that DBAs handle in production.

Note: as I progress through this series and talk about aspects of the log, there are often little edge-cases or weird behaviors in niche circumstances that have been added or changed over the years. I’ll ignore those unless I specifically want to call them out, otherwise the posts would be riddled with rat-holes and mazes of twisty-turny little passages (yes, I loved ’80s text-based adventure games 🙂 that would distract from the main things to learn about in each post.

Click through for a primer on virtual log files, log blocks, and log sequence numbers.

Leave a Comment

Official Support for fabric-cicd Tool

Yaron Pri Gal announces support for a library:

Today, we’re announcing that fabric‑cicd—the open‑source Python deployment library for Microsoft Fabric—is now an officially supported, Microsoft‑backed tool for CI/CD automation across Fabric workspaces.

Over the past year, fabric‑cicd has rapidly evolved through collaboration with engineering, CAT, MVPs, enterprise customers, and the community. Growing usage, strong sentiment across internal and external channels, and adoption by organizations building enterprise‑grade deployment pipelines helped solidify its value within the Fabric ecosystem.

Read on to learn what this means.

Leave a Comment

Comparing Techniques for Text Featurization in Classification Problems

Ivan Palomaras Carrascosa tries a few things:

In this article, you will learn how Bag-of-Words, TF-IDF, and LLM-generated embeddings compare when used as text features for classification and clustering in scikit-learn.

Topics we will cover include:

  • How to generate Bag-of-Words, TF-IDF, and LLM embeddings for the same dataset.
  • How these representations compare on text classification performance and training speed.
  • How they behave differently for unsupervised document clustering.

Click through for results. Granted, the specific embedding model can alter the quality of results, but even so, I do enjoy the comparison of techniques and the reminder that neural networks aren’t the ultimate solution to everything.

Leave a Comment

A Review of the Portmanteau Theorem

Ben Smith digs into a theorem:

The Portmanteau Theorem provides a set of equivalences of weak convergence that still remains relevant for establishing asymptotic results in probability and statistics. While the theory around weak convergence is well developed, I was inspired to put together a writeup proving all the equivalences in a self contained manner, by first presenting the relevant theorems applied (without proving them) along with along with a visual on the implication cycle created for the proof and some discussion about other presentations available in popular textbooks and some historical notes.

Click through for the PDF.

Leave a Comment

Web Scraping with Python

Jason Yousef has a script:

Below is a production-friendly pattern that:

  • Uses a requests.Session with retries, backoff, and a real User-Agent
  • Sets sane timeouts and handles common HTTP errors
  • Respects robots.txt (and tells you if scraping is disallowed)
  • Parses only mailto: links by default to avoid scraping personal data you shouldn’t
  • Handles pagination with a “Next” link when present
  • Exports to CSV
  • Can be run from the command line with arguments

Click through for the code, some explanation of how it works, and a few tips.

Leave a Comment

Running Totals over Arbitrary Date Ranges

Louis Davidson solves an interval problem:

Say you want to find the most recent 30-day period during which a person purchased some amount of products from your company. How you market to a customer might change if they have been active over a time period recently, or even in the past. But this also means that for each day going back in history, you need to sum historic data over and over, and the previous 29 days of activity. This is generally known as a rolling total. Doing this sort of calculation has been an interesting problem for many years.

When window functions came around, they became quite useful for such tasks, but they have one kind of complicated problem: gaps in source data patterns.

Funnily enough, there is a solution using window functions: range intervals. The ANSI SQL definition for RANGE (versus ROWS) for window functions does allow for the specification of a date range, like RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW. Very impressive.

Unfortunately, SQL Server doesn’t support these. PostgreSQL does, but it’s an area I’ve agitated about for a few years and I do hope that someday, the SQL Server product team will support this functionality. In the meantime, Louis has a solution that works well for the task.

Leave a Comment

Read Efficiency in PostgreSQL Queries

Michael Christofides explains what’s happening under the covers:

A lot of the time in database land, our queries are I/O constrained. As such, performance work often involves reducing the number of page reads. Indexes are a prime example, but they don’t solve every issue (a couple of which we’ll now explore).

The way Postgres handles consistency while serving concurrent queries is by maintaining multiple row versions in both the main part of a table (the “heap”) as well as in the indexes (docs). Old row versions take up space, at least until they are no longer needed, and the space can be reused. This extra space is commonly referred to as “bloat”. Below we’ll look into both heap bloat and index bloat, how they can affect query performance, and what you can do to both prevent and respond to issues.

Read on for a detailed explanation.

Leave a Comment