It could have chosen a Hash Join, but then the order of the Id column from the Posts table wouldn’t have been preserved on the other side.
Merge Joins are order preserving, Hash Joins aren’t. If we use a Hash Join, we’re looking at ordering the results of the join after it’s done.
Read on to learn why, as well as why a few other things are so.
SQL Server Spool operators are a mixed bag. On one hand, they can negatively impact performance when writing data to disk in tempdb. On the other hand, they allow filtered and transformed result sets to be temporarily staged, making it easier for that data to be reused again during that query execution.
The problem with the latter scenario is that SQL Server doesn’t always decide to use a spool; often it’s happy to re-read (and re-process) the same data repeatedly. When this happens, one option you have is to explicitly create your own temporary staging table that will help SQL Server cache data it needs to reuse.
The other problem with spooling is that the spool doesn’t have indexes and so performance can be awful. When I look at an execution plan, one of my immediate red flags is spooling: if we have that, removing it is one of the first candidates for optimization after the trivial stuff (expected scan/seek behavior, “fat pipes” from excessive row counts, residual I/O, etc.).
Let’s also say that bad query is taking part in a modification.
SET u2.Reputation *= 2
FROM Users AS u
JOIN dbo.Users AS u2
ON CHARINDEX(u.DisplayName, u2.DisplayName) > 0
WHERE u2.Reputation >= 100000;
AND u.Id <> u2.Id;
This query will run for so long that we’ll get sick of waiting for it. It’s really holding up writing this blog post.
Erik rewrites this query a couple of times. Click through to learn what he does and why he does it.
When inserting rows using
INSERT...SELECTinto a heap with no nonclustered indexes, the documentation universally states that such inserts will be minimally logged as long as a
TABLOCKhint is present. This is reflected in the summary tables included in the Data Loading Performance Guide and the Tiger Team post. The summary rows for heap tables without indexes are the same in both documents (no changes for SQL Server 2016):
But it’s not quite that straightforward, as Paul shows. Read the whole thing.
Even though this query reads the whole clustered index to get the
Benefactorrows, the total number of logical reads is still smaller than the seek/key lookup pattern seen in the combined query with IN(). This UNION ALL version gives SQL Server the ability to build a hybrid execution plan, combining two different techniques to generate a plan with fewer overall reads.
Click through for the example.
Linked Servers offer a great way to connect two SQL Servers together, allowing remote querying and DML operations. Frequently, this is used to copy data from production to reporting. However, the temptation is to run the copy operation on the production, or source side. If you do that, even with a single
INSERT INTOstatement, SQL Server will process each individual row as a discrete
INSERT INTOstatement via a cursor operation. This makes for very slow inserts across a linked server. Running the operation from the destination server means SQL Server can simply query the remote source table for all the rows, inserting them as a set into the destination table. The difference in speed can be eye-watering.
Click through for a slightly creepy picture and a less creepy example.
Locking hints can be really handy in these situations, especially the READPAST hint. The documentation for it says that it allows you to skip over row level locks (that means you can’t skip over page or object level locks).
What it leaves out is that your READPAST query may also need to try to take row level shared locks.
Read on for an example as well as an alternative which ends up being better in this case.
If you’ve been query tuning for a while, you probably know about SARGability, and that wrapping columns in functions is generally a bad idea.
But just like there are slightly different rules for CAST and CONVERT with dates, the repercussions of the function also vary.
Read the whole thing. Maybe “go to brunch” in the middle of it for maximum effect.
If you’ve used
FIRST_VALUEbefore, this query should be easy to interpret: for each badge
Name, return the first
Date(earliest date to receive the badge) and
UserId(pick the lowest UserId when there are ties on
This query was easy to write and is simple to understand. However, the performance is not great: it takes 46 seconds to finish returning results on my machine.
Bert’s response is to rewrite the query using a correlated subquery. My first shot would look at using
APPLY though needing to aggregate the “parent” could lead to an awful result there if the join happened before aggregation.
The moral of the story here is to know different ways to write a query, as you can nudge the optimizer to better (or worse) behavior.
Somewhere along the way in your career, you were told that:
– Index seeks are quick, lightweight operations
– Table scans are ugly, slow operations
And ever since, you’ve kept an eye on your execution plans looking for those performance-intensive clustered index scans. When you see ’em, you go root ’em out, believing you’ve got a performance problem.
Thing is, … they lied to you. Seeks aren’t necessarily good, nor are scans necessarily bad. To straighten you out, we’re going to walk through a series of demos.
The rule of thumb I like to use is: if you need to go through more than 20% of the data, you’re generally better off scanning. If you need to go through less than 0.5% of the data, you’re generally better off seeking. Everything in between is the “it depends” zone.