Press "Enter" to skip to content

Day: September 11, 2025

Challenges of High-Dimensional Optimization

John Mount lays out a demonstration:

My experience is that common objective functions tend to be structured and full of coincidences and symmetries. And because they have these structures they are hard to optimize.

Let’s work up what I claim to be a fairly typical optimization problem that arises from planning or scheduling. I’ll call it the train arrival schedule problem.

Click through for the article, which includes demonstration code.

Leave a Comment

Comparing the ROC Curve to a Precision-Recall Curve

Ivan Palomares Carrascosa looks at two ways to plot classification model trade-offs:

When building machine learning models to classify imbalanced data — i.e. datasets where the presence of one class (like spam email for example) is much less frequent than the presence of the other class (non-spam email, for instance) — certain traditional metrics like accuracy or even the ROC AUC (Receiving Operating Characteristic curve and the area under it) may not reflect the model performance in realistic terms, giving overly optimistic estimates due to the dominance of the so-called negative class.

Precision-recall curves (or PR curves for short), on the other hand, are designed to focus specifically on the positive, typically rarer class, which is a much more informative measure for skewed datasets due to class imbalance.

Read on to see how these two curves can diverge and when you might trust one over the other. Ivan’s post does rely on the idea of the positive class being the smaller one and the dataset being markedly unbalanced

Leave a Comment

Accelerated Database Recovery and NOLOCK

Brent Ozar has a new episode of Customers Say the Darndest Things:

I have never seen a T-SQL feature that people love as much as NOLOCK.

I keep thinking I’ve written enough blog posts about it, but a client came up with a new one:

We use SQL Server 2022’s Accelerated Database Recovery, which keeps copies of versions inside the table. Plus, we don’t use transactions – our inserts, updates, and deletes are done one table at a time, and your demos always have transactions in them, so we’re not affected.

That’s not how this works. That’s not how any of this works.

It might be because it’s early in the morning when I type this out, but I’m having a hard time even conceptualizing what the customer could be thinking here. Brent does lay it out in the comments below, however, and yeah, I’d still raise an eyebrow if someone said that out loud to me.

But the moral of the story is, find someone who loves you like T-SQL developers love NOLOCK.

Leave a Comment

Substring Search with Regular Expressions in SQL Server

Louis Davidson continues a series on regular expressions:

The REGEXP_SUBSTR function extracts parts of a string based on a regular expression pattern. It has some similarieties with the SUBSTRING function, but with some important (and interesting) differences. This function returns Nth occurrence of a substring that matches the regex pattern.

Read on to see how it compares to the traditional SUBSTRING() function.

Leave a Comment

Sparse Columns and Space Utilization

Steve Jones gins up a demo:

I saw this as a question submitted at SQL Server Central, and wasn’t sure it was correct, but when I checked, I was surprised. If you choose to designate columns as sparse, but you have a lot of data, you can use more space.

This post looks at how things are stored and the impact if much of your data isn’t null.

I consider sparse columns a relic of the mid-aughts era, when storage was a lot more expensive and compression was an Enterprise Edition-only feature. Given that you can use page compression in any edition of SQL Server nowadays, I don’t think there’s a viable reason ever to have a sparse column.

Also, definitely check out the comments, where Jeff Moden has a great one.

Leave a Comment

K-Means Clustering in SQL Server

Sebastiao Pereira implements k-means clustering in T-SQL:

K-means clustering is an unsupervised machine learning algorithm used to group data into k distinct clusters based on their similarity, allowing for customer segmentation, anomaly detection, trend analysis, etc. The most common machine learning tutorials focus on Python or R. Normally, data is stored in SQL Server, and it is necessary to move data out of the database to apply clustering algorithms and then, if necessary, to update the original data with the cluster numbers. Is it possible to do it directly in SQL Server?

Given the work you have to do to implement this, I can’t imagine that it would be particularly fast. But it is neat to see that it’s possible.

Leave a Comment