Category: Syntax

Set Operations in T-SQL

Published 2025-01-13 by Kevin Feasel

Erik Darling has a pair of videos covering 3 1/2 set operations. First up is UNION and UNION ALL. These are the set operations that most people know about and use fairly regularly, and Erik explains the difference between the two, including the performance difference between the two.

Then, Erik hits upon the two lesser-known set operations: INTERSECT and EXCEPT. These are extremely useful in certain circumstances, and tend to perform much better than other alternatives. For example, to figure out if two datasets are exactly the same, it’s really hard to go wrong with the following pair of queries:

SELECT a.* FROM a EXCEPT SELECT b.* FROM b;
SELECT b.* FROM b EXCEPT SELECT a.* FROM a;

You need both queries because the former tells you if there are any records in A that do not exist in B, either because the record simply is not there or because there is a difference in one or more values in B’s matching record. But then, you also have to check the opposite side, because there might be a record in B that does not exist in A and the first query will not expose it.

This is typically the way I’d write test cases, ensuring that both queries return 0 results. Granted, you could always just check that the count of the intersection equals the count of records:

DECLARE @c1 INT, @c2 INT;
SELECT @c1 = SELECT COUNT(*) FROM (SELECT a.* FROM a INTERSECT SELECT b.* FROM b);
SELECT @c2 = SELECT COUNT(*) FROM a;

In this case, @c1 and @c2 should be the same number.

1 Comment

Window Function Ranges: UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING

Published 2025-01-09 by Kevin Feasel

Chad Callihan engages the limit breaker:

I’m familiar with using the OVER clause and don’t think it’s too uncommon to see it used for including row numbers by using ROW_NUMBER() and aggregating data. But even though they’ve been around since SQL Server 2012, I’m not too familiar with using the OVER clause with the UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING to affect the window being queried.

Let’s take a look at a couple of examples using UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING.

Click through for those examples. The default ranges for window functions usually make a lot of sense, but it’s good to understand your options for frames: ROWS vs RANGE, as well as the frame values (UNBOUNDED PRECEDING, {N} PRECEDING, CURRENT ROW, {N} FOLLOWING, and UNBOUNDED FOLLOWING).

Comments closed

Calculating Inter-Quartile Range and Z Score in T-SQL

Published 2025-01-07 by Kevin Feasel

Sebastiao Pereira hunts for outliers:

Outliers can significantly distort statistical analysis and lead to incorrect conclusions when interpreting data. In this article, we will look at how to find outliers in SQL Server using various T-SQL queries. Understanding how to find outliers in SQL is crucial for accurate data analysis.

Sebastiao uses PERCENTILE_CONT() in this demonstration. That works fine for relatively small tables, though it does not scale well at all. Once you’re in the millions of records, it gets slow. From there, my joke is that, if you have 100 million or more records, you can start a query with PERCENTILE_CONT() on one instance. Meanwhile, on a separate instance, as soon as you kick off that query, go install SQL Server ML Services, configure it, check out a tutorial on R or Python, figure out how you can calculate the inter-quartile range in that language, learn how ML Services works, and you’ll still get the answer before your first query finishes.

If you’re using SQL Server 2022, there is a new APPROX_PERCENTILE_CONT() that is orders of magnitude faster as you get increasingly large datasets. It’s also accurate to within 1.33% (on each side of the correct answer) within a 99% confidence. The way the query works is a bit different, though, because the approximation is a nested set function using a WITHIN GROUP() clause, whereas PERCENTILE_CONT() is a window function that uses an OVER() clause. That means it’s not quite as easy as slapping “APPROX_” to the start of the query, but because Sebastiao uses WITHIN GROUP in the T-SQL, it’s pretty close: PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY [ObsValue]) OVER() AS Q1 becomes APPROX_PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY [ObsValue]) AS Q1 or something like that–I’m compiling in the browser here.

Comments closed

Using the APPLY Operator

Published 2025-01-06 by Kevin Feasel

Erik Darling knows how to get my interest.

No text for me to copy, but Erik goes into detail on good use cases of the APPLY operator. Erik hits on several of the use cases that I like to use APPLY for and does a good job demonstrating its value.

Comments closed

Azure SQL DB String Concatenation and JSON Functions

Published 2024-12-16 by Kevin Feasel

Magda Bronowska takes a look at some functionality currently available only in Azure SQL Database and Managed Instance:

Microsoft releases the classic SQL Server every couple of years, with some functionality added through regular updates. On the other hand, the SQL Server offering in Azure (Azure SQL Database and Managed Instance) receives the latest features earlier.

This post highlights some of the T-SQL functions currently available in Azure SQL but not yet in classic SQL Server. However, with the recent announcement of SQL Server 2025, this might change next year. Keep in mind that some of these functions are in preview, so their behavior might evolve as they reach general availability.

Click through for those examples.

Comments closed

Alternatives to Long IN() Lists in SQL Server

Published 2024-12-05 by Kevin Feasel

David Fowler still has a list:

This post comes off the back of my last, where I looked at issues caused by explicitly declaring a large number of values in an IN clause. The query processor ran out of internal resources and could not produce a query plan – When You’ve Managed to Confuse SQL With a Crazy IN Clause.

My suggestion was to put the values into a temp table and carry out a semi-join.

That got me thinking, which of the two methods would perform better.

I’m going to look at the two methods for different amounts of values and have a look at how they get on in terms of reads and total time.

Read on for the answer. If I’m surprised by anything in it, it’s that the threshold where temp tables out-perform the IN() clause is so high.

Comments closed

Against ODBC Functions

Published 2024-12-05 by Kevin Feasel

Andy Brownsword stakes a claim:

It’s a function, but not quite as we know it. They’re surrounded by { curly braces } and prefixed with fn. The kind of thing I’d expect to see in PowerShell.

It turns out they were introduced in ODBC versions 1-3, with the latter being based on the SQL-92 specification. These were introduced in 1995 and 1992 respectively, which likely explains why I don’t see them used.

I’m thinking hard and don’t believe I’ve ever seen anyone use these in code I’ve maintained. Thus, I agree with Andy: you probably don’t need them.

Comments closed

Avoid WHERE COUNT(*) = 0

Published 2024-12-02 by Kevin Feasel

Hubert Lubaczewski makes a comparison:

Every now and then I see something like this:
SELECT u.* FROM users u
WHERE 0 = (SELECT COUNT(*) FROM addresses a WHERE a.user_id = u.id);
and it kinda pains me. So figured, I’ll write about what is the problem with it, and how to avoid such constructs.

Read on to understand what the problem is and how you can fix it.

6 Comments

GROUP BY Clause Order and Performance in PostgreSQL

Published 2024-11-26 by Kevin Feasel

Andrei Lepikhov runs some tests:

PostgreSQL users often employ analytical queries that sort and group data by different rules. Optimising these operators can significantly reduce the time and cost of query execution. In this post, I will discuss one such optimisation: choosing the order of columns in the GROUP BY expression.

Postgres can already reshuffle the list of grouped expressions according to the ORDER BY condition to eliminate additional sorting and save computing resources. We went further and implemented an additional strategy of group-by-clause list permutation in a series of patches (the first attempt and the second one) for discussion with the Postgres community, expecting it to be included in the next version of PostgreSQL core. You can also try it in action in the commercial Postgres Pro Enterprise fork.

From a “this is a 4th generation language” perspective, there should never be a performance difference in the ordering of grouped columns in a GROUP BY clause. The database optimizer should always determine the optimal ordering. Read Andrei’s article to see how much of a challenge that can be in practice.

Comments closed

New Syntax in SQL Server 2022

Published 2024-11-25 by Kevin Feasel

Steve Jones talks syntax:

At the recent 2024 PASS Data Community Summit, I saw a lightning talk from Mikey Bronowski on the New T-SQL Functions in SQL Server 2022. Before the talk, I made a joke with him that none of these were new because it’s 2024. They’d been out for 2 years.

Mikey did a nice job, given that he only had 10 minutes, but it was enough to give attendees an idea of some new things they might investigate to use in their own code. If you missed the talk, or you don’t have access to the recordings, we have a series on SQL Server Central that covers these (Part 1, Part 2, Part 3) and helps you understand the new options. The list of language changes is also in the MS Docs under What’s New in the Language.

Steve asks about how much utilization there is of these. I think the problem is that so few organizations have adopted SQL Server 2022 that knowledge that these things even exist is quite low.

By the way, I do have a script covering many of these new bits of syntax as part of a talk on getting up to speed with core T-SQL changes since SQL Server 2012.

Comments closed