Press "Enter" to skip to content

Month: January 2025

Idle PostgreSQL Transactions and Table Bloat

Umair Shahid notes that some tables are feeling a bit bloated:

Yup, you read it right. Idle transactions can cause massive table bloat that the vacuum process may not be able to address. Bloat causes degradation in performance and can keep encroaching disk space with dead tuples. 

This blog delves into how idle transactions cause table bloat, why this is problematic, and practical strategies to avoid it.

Read on to understand how this can be and what you can do about it. And do check out the comments for a quick explanation of why connection pooling doesn’t exhibit this same problem.

Comments closed

Dealing with Duplicate Data via ROW_NUMBER()

Andy Brownsword removes the duplicates:

Data quality and consistency is key to the services we support and solutions we deliver. A gremlin which can undermine that is duplicate data. Let’s start the new year dealing with duplicate data and having a good clear-out.

For our example we’ll consider an Order Product table which contains an OrderID and ProductID, and the combination of these should be unique. Other fields for the duplicate records may differ so we may want to be selective about which records are removed.

This is where I get on my high horse and complain about laziness in data modeling, a very common problem. This takes nothing away from Andy’s post, which is a good method for solving a problem that has gotten out of hand. But if you know that some combination of attributes is unique, add a unique key constraint or a unique non-clustered index right then and there. Doing so will prevent improper duplicate data from ever being an issue. If you don’t know that some combination of attributes must be unique, discuss this with the business side in a way that makes sense for them. Yes, there’s always the risk that you’ll have a conversation later like, “Oh, it turns out that this really should be unique,” but in most cases, you can easily sort this kind of thing out up-front and save a lot of time and effort later on.

Comments closed

Prevalence Adjustment in Binary Classifiers

David Lindelöf deal with an issue in analyzing classification models:

When you run a binary classifier over a population you get an estimate of the proportion of true positives in that population. This is known as the prevalence.

But that estimate is biased, because no classifier is perfect. 

Read on to learn what this means for precision, as well as one technique for tracking prevalence changes over itme.

Comments closed

Entity Framework and Default Data Lengths

Brent Ozar points out one issue you might run into when using Entity Framework:

Most of the time, I love Entity Framework, and ORMs in general. These tools make it easier for companies to ship applications. Are the apps perfect? Of course not – but they’re good enough to get to market, bring in revenue to pay salaries, and move a company forwards.

However, just like any tool, if you don’t know how to use it, you’re gonna get hurt.

One classic example popped up again last month with a client who’d used EF Core to design their database for them. The developers just had to say which columns were numbers, dates, or strings, and EF Core handled the rest.

Read on for the scenario.

Comments closed

Window Function Ranges: UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

Chad Callihan engages the limit breaker:

I’m familiar with using the OVER clause and don’t think it’s too uncommon to see it used for including row numbers by using ROW_NUMBER() and aggregating data. But even though they’ve been around since SQL Server 2012, I’m not too familiar with using the OVER clause with the UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING to affect the window being queried.

Let’s take a look at a couple of examples using UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING.

Click through for those examples. The default ranges for window functions usually make a lot of sense, but it’s good to understand your options for frames: ROWS vs RANGE, as well as the frame values (UNBOUNDED PRECEDING, {N} PRECEDING, CURRENT ROW, {N} FOLLOWING, and UNBOUNDED FOLLOWING).

Comments closed

Reading a SQL Server XML Deadlock Report

Stephen Planck reads a report:

SQL Server includes an Extended Events session called system_health, which runs by default and, among other things, captures information about deadlocks as they occur. When two or more sessions block each other in such a way that no progress can be made (a deadlock), SQL Server chooses one session as the “victim,” rolls back its transaction, and frees resources so other sessions can continue. By reviewing the deadlock report in the system_health session’s XML output, you can see precisely why the deadlock happened and identify which queries or procedures were involved.

Below is a walkthrough of how to interpret a sample XML deadlock report, followed by a brief note on how to access this output.

Read on for that walkthrough.

Comments closed

Building a QR Code Clock

Tomaz Kastrun checks what time it is:

Ever wanted to have a clock on the wall or in the office, that is not binary. But it is QR-Code clock. Well, now you can have it.

This useless R function generates new QR Code for every given period and tells the time.

Click through for the code. I could see this being useful in scenarios where you want to avoid people copying the QR code, so you embed the time in there. Then, your reader service can check to see if the time is within some valid boundary, returning an error if not.

Comments closed

The Power of One Data Point

I have a new video:

In this video, I demonstrate how much information we can gain from one sample of a distribution.

Some aspect of this is “that’s a neat parlor trick” but it does speak to the marginal information gain of a small amount of data.

Comments closed