Press "Enter" to skip to content

Category: T-SQL

Substring Search with Regular Expressions in SQL Server

Louis Davidson continues a series on regular expressions:

The REGEXP_SUBSTR function extracts parts of a string based on a regular expression pattern. It has some similarieties with the SUBSTRING function, but with some important (and interesting) differences. This function returns Nth occurrence of a substring that matches the regex pattern.

Read on to see how it compares to the traditional SUBSTRING() function.

Leave a Comment

K-Means Clustering in SQL Server

Sebastiao Pereira implements k-means clustering in T-SQL:

K-means clustering is an unsupervised machine learning algorithm used to group data into k distinct clusters based on their similarity, allowing for customer segmentation, anomaly detection, trend analysis, etc. The most common machine learning tutorials focus on Python or R. Normally, data is stored in SQL Server, and it is necessary to move data out of the database to apply clustering algorithms and then, if necessary, to update the original data with the cluster numbers. Is it possible to do it directly in SQL Server?

Given the work you have to do to implement this, I can’t imagine that it would be particularly fast. But it is neat to see that it’s possible.

Leave a Comment

Sparse Columns and Space Utilization

Steve Jones gins up a demo:

I saw this as a question submitted at SQL Server Central, and wasn’t sure it was correct, but when I checked, I was surprised. If you choose to designate columns as sparse, but you have a lot of data, you can use more space.

This post looks at how things are stored and the impact if much of your data isn’t null.

I consider sparse columns a relic of the mid-aughts era, when storage was a lot more expensive and compression was an Enterprise Edition-only feature. Given that you can use page compression in any edition of SQL Server nowadays, I don’t think there’s a viable reason ever to have a sparse column.

Also, definitely check out the comments, where Jeff Moden has a great one.

Leave a Comment

Batching Large Data Operations via Key Ranges

Andy Brownsword updates or deletes a batch of rows:

Effective batching in general helps us by:

  • Reduce transaction length and minimise blocking
  • Avoids unnecessary checking of the same rows repeatedly
  • Introduce graceful pacing to reduce impact on busy environments or data replication

I’m not the biggest fan of the OFFSET/FETCH combination there, at least if your key column is fairly well packed—like, say, 99+% of the rows are contiguous and you occasionally have a jump of a few thousand rows. Also, that batch size of 100K might be a little high, although that will certainly depend on what the operation is. Batch updating a column based on some fairly straightforward calculation? You can probably get away with 100K, though I’d still prefer 10K. But as you add more complexities (deleting rows, very high server throughput, triggers, limited hardware, etc.), that number should edge downward.

Leave a Comment

Splitting to a Table via Regular Expression

Louis Davidson creates a table:

Continuing on with the REGEXP_ functions series, the next one I want to cover is the table valued function REGEXP_SPLIT_TO_TABLE. This function is definitely one of the ones you probably ought to know, especially if you are ever tasked to pull some data out of a data structure.

This function is a lot like the STRING_SPLIT function, and unlike things like the REGEXP_LIKE function, you can basically use the same main parameters as you used in STRING_SPLIT for simple cases, but from there the possibilities are a lot more endless because you can define almost any delimiters you want. It isn’t perfect, because of a few things, but we will discuss that more later on.

Read on to see how it works, including one major caveat.

Leave a Comment

Date Intervals in PostgreSQL Window Functions

Hubert Lubaczewski solves a problem:

Since I can’t copy paste the text, I’ll try to write what I remember:

Given table sessions, with columns: user_id, login_time, and country_id, list all cases where single account logged to the system from more than one country within 2 hour time frame.

The idea behind is that it would be a tool to find hacked account, based on idea that you generally can’t change country within 2 hours. Which is somewhat true.

Solution in the blogpost suggested joining sessions table with itself, using some inequality condition. I think we can do better…

Click through for a solution that works for PostgreSQL but not SQL Server because the latter doesn’t offer date and time intervals on window function frames.

To do this in SQL Server, I’d probably use LAG() and get the prior value of country ID and the prior login time. Something like the following query, though I didn’t run detailed performance checks.

WITH records AS
(
	SELECT
		s.user_id,
		s.login_time,
		s.country_id,
		LAG(s.login_time) OVER (PARTITION BY s.user_id ORDER BY s.login_time) AS prior_login_time,
		LAG(s.country_id) OVER (PARTITION BY s.user_id ORDER BY s.login_time) AS prior_country_id
	FROM sessions s
)
SELECT *
FROM records r
WHERE
	r.prior_country_id <> r.country_id
	AND DATEDIFF(HOUR, r.prior_login_time, r.login_time) <= 2;
Leave a Comment

Replacing Text in SQL Server 2025 via Regular Expression

Louis Davidson continues a series on regular expressions in SQL Server 2025:

Okay, we have gone through as much of the RegEx filtering as I think is a a part of the SQL Server 2025 implementation. Now it is time to focus on the functions that are not REGEXP_LIKE. We have already talked about REGEXP_MATCHES, which will come in handy for the rest of the series.

I will start with REGEXP_REPLACE, which is like the typical SQL REPLACE function. But instead of replacing based on a static delimiter, it can be used to replace multiple (or a specific) value that matches the RegEx expression. All of my examples for this entry will simply use a variable with a value we are working on, so no need to create or load any objects.

Read on to see how it works, including plenty of examples.

Leave a Comment

Migrating Azure Data Studio SQL Notebooks to VS Code Polyglot Notebooks

Haroon Ashraf gives us a somewhat unwieldy process:

As a SQL/BI developer, I want to run and store my SQL scripts and documentation efficiently in a Notebook as an alternative to using Azure Data Studio SQL Notebooks since Azure Data Studio is retiring soon. Read on to learn more about Visual Studio Code Polyglot Notebooks.

I liked the simplicity of having a SQL kernel in Azure Data Studio. Haroon shows how to work around it and get to roughly the same spot, but I do hope the SQL Server tools team is able to migrate that SQL kernel over to VS Code prior to Azure Data Studio’s ultimate demise.

Leave a Comment