SQL Server 2008 is reaching end of support this year, so upgrading your SQL Server might be on your mind. One of the big changes when you upgrade your SQL Servers is upgrading the compatibility level, which by default will upgrade the cardinality estimator (CE).
This can change query performance, for better or for worse. This post won’t focus on whether it’s good or bad, but instead I want to show you how you can check to see what CE was used by your queries.
It’s not a 100% guarantee, but generally I’ve found the new estimator to be superior.
Sometimes, the optimizer can take a query with a complex where clause, and turn it into two queries.
This only happens up to a certain point in complexity, and only if you have really specific indexes to allow these kinds of plan choices.
This is a case where I generally don’t trust the optimizer to get it right. Even when it does, I’d be concerned that this won’t be a stable solution and a minor change somewhere could result in regression to a bad plan.
I’m going to talk about my favorite example, because it can cause a lot of confusion, and can hide a lot of the work it’s doing behind what appears to be a friendly little operator.
Something to keep in mind is that I’m looking at the actual plans. If you’re looking at estimated/cached plans, the information you get back may be inaccurate, or may only be accurate for the cached version of the plan. A query plan reused by with parameters that require a different amount of work may have very different numbers.
I like nested loop joins a lot, but there’s a big difference between a loop running a few dozen times and a loop running a couple hundred thousand times, even if the operator doesn’t show you that immediately.
Making plan choices with IF branches like this plain doesn’t work.
The optimizer compiles a plan for both branches based on the initial compile value.
What you end up with is a stored proc that doesn’t do what it’s supposed to do, and parameter sniffing times two.
Read on to see an example of this. If you really, really want to use an IF block, you could separate the components out into individual stored procedures and call those stored procedures independently.
When building indexes for your queries, the order of your index key columns matters. SQL Server can make the most effective use of an index if the data in that index is stored in the same order as what your query requires for a join, where predicate, grouping, or order by clause.
But if your query requires multiple key columns because of multiple predicates (eg. WHERE Color = ‘Red’ AND Size= ‘Medium’), what order should you define the columns in your index key column definition?
One of my favorite books for query tuning is a bit long in the tooth at this point but remains quite relevant, and a key point there is to look for ways to drop the largest percent of rows as soon as possible. This applies for good indexes as well: they’ll let you ignore as large a percentage of your irrelevant data as you can, as soon as possible.
When I do performance tuning for clients, I really pride myself on making as few changes as possible in order to make a tremendous difference. I consider it a failure when I have to tell someone to rewrite a bunch of queries from scratch.
However, there are some cases where I just can’t work around a perfect storm of anti-patterns. To understand why, let’s take an example. I’ve kept the general idea the same, but I’ve rewritten the entire example in the Stack Overflow database to protect the innocent.
I’ve seen cases similar to what Brent has. Developers understand encapsulation and minimizing code repetition, so they naturally want to do that with SQL, but the optimizer eventually gives up and picks a terrible plan. DRY is great for application code and normalization, but unfortunately, it’s not always great for T-SQL.
In the original plan, the TOP asked for rows, and quickly got them.
In the second plan, the TOP kept asking for rows, getting them from the Votes table, and then losing them on the join to Posts.
There was no parameter sniffing, there were no out of date stats, no blocking, or any other oddities. It’s just plain bad luck because of the data’s relationship.
Read the whole thing.
Next query, for median calculation was a window function query.
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c1)
OVER (PARTITION BY (SELECT 1)) AS MedianCont
To my surprise, the performance was even worse, and at this time, I have to say, I was running this on SQL Server 2017 with CU7. But luckily, I had a SQL Server 2019 CTP 2.0 also installed and here, with no further optimization the query ran little over 1 second.
I’ve warned people away from this function for all but tiny data sets because of how poorly it performs. With SQL Server 2019, I might be able to recommend it.
That’s a lot of desired memory, 1,493,120 KB aka 1.4 GB, but there was only
25 MB used in the sort. So why was SQL Server so far off the right estimate? It’s the data types we picked.
That’s a lot of memory for a fairly simple query returning 300,000 rows, each containing a string and an int.
Hash Match joins are the dependable workhorses of physical join operators.
While Nested Loops joins will fail if the data is too large to fit into memory, and Merge Joins require that the input data are sorted, a Hash Match will join any two data inputs you throw at it (as long as the join has an equality predicate and you have enough space in tempdb).
Bert has some great animated GIFs too.