A probe residual is important because they can indicate key performance problems that might not otherwise be brought to your attention.
What is a probe residual?
Simply put, a probe residual is an extra operation that must be performed to compete the matching process. Extra being left over things to do.
Click through for an example brought about by implicit conversion.
To bust a myth, if you have good reasons to think that a differently written, but semantically equivalent query might be faster (on your database), you should measure. Don’t even trust any execution plan, because ultimately, what really counts is the wall clock time in your production system.
If you can measure your queries in production, that’s perfect. But often, you cannot – but you don’t always have to. One way to compare two queries with each other is to benchmark them by executing each query hundreds or even thousands of times in a row.
Lukas goes on to compare a left join to a correlated subquery in three separate database products and the results are very interesting although absolutely not comparable across products because the DeWitt Clause is still a thing. Great read.
So, for all you NOLOCKers out there, you can now save yourselves oodles of time by only using the hint in outer references to your CTEs and Views.
Congratulations, I suppose.
(Please stop using NOLOCK.)
Agreed, whenever possible.
You may see that in the first plan, there are two clustered index scans of the table SalesOrderDetail, however the subquery is exactly the same “exists (select * from Sales.SalesOrderDetail f where f.SalesOrderID = d.SalesOrderID)” but referenced twice.
In the second case, compiled under next compatibility level, the double reference of the subquery is collapsed and we see only one reference to the SalesOrderDetails table and more efficient plan, despite the query still has two subqueries with SalesOrderDetails.
In the third case, also compiled under vNext level, we see the second branch with the SalesOrderDetail again, but that is because we turned off the rule CollapseIdenticalScalarSubquery with an undocumented hint queryruleoff (which I originally described in my blog post).
I think Dmitry has the expected use case nailed: ORMs. But I can see people writing (well, copy-pasting) similar queries, so maybe it’ll be useful in more contexts as well.
The table has 10,000,000 rows. I’ve create a non-clustered columnstore index on the table, which I’ll talk about in a future post. I’ve included it here because it provides a succinct difference in the two plans.
To compare the plans visually, side-by-side, you need to save the first plan by right-clicking on the plan window, clicking “Save Execution Plan As…”, and specifying a filename. Next, right-click on the plan window, and choose “Compare Showplan”:
I’ve only used this once or twice, but it is an interesting feature.
Ultimately I think any thought of the readable secondary having a vastly different plan was a red herrings. Statistics are going to be the same on both instances, and if there were a missing statistic on the secondary, SQL Server would create it in TempDB. Anyway, columnstore indexes don’t use statistics in the traditional sense.
Fortunately I was able to catch a query in the process of waiting on HTDELETE, so I no longer had to look for the needle in the haystack, and I could get to tuning the plans. I was able to grab the SELECT part of the query and generate an estimated plan on both the primary and secondary nodes. The plans were virtually the same on both nodes, with just a minor difference in memory grant between them.
Click through for the solution.
Our query memory grants range from around 8 MB to around 560 MB. This isn’t even ordering BY the larger columns, this is just doing the work to sort results by them. Even if you’re a smarty pants, and you don’t use unnecessary ORDER BY clauses in your queries, SQL may inject them into your query plans to support operations that require sorted data. Things like stream aggregates, merge joins, and occasionally key lookups may still be considered a ‘cheaper’ option by the optimizer, even with a sort in the plan.
Of course, in our query plans, we have warnings on the last two queries, which had to order the VARCHAR(8000) column.
This shows just how much difference a simple column size can make.
Once the pattern is down, the use is pretty straightforward. There’s also more options accessible to you. If we just look at the RunTimeCountersPerThread node, we can compare other values such as Rows, Scans, and CPU time. We could really get crazy and extract all the different statements within the batch. There are numerous possibilities for analysis and review.
I’m not here to tell you that you should start using PowerShell to automate query tuning. Query performance is an art form and requires a lot of case-by-case analysis. However, like any great carpenter, it’s good to know the capabilities of your tool set. Understanding the options available to you not only helps you be more effective, but can also provide answers you may not have had access to.
It’s another tool for the belt.
So yeah, that function seems to get up to something once for every ID you pass in. Remember that in our STUFF… query, we grabbed the TOP 10 each time. In the XE session, each time we call the proc, the string splitting function compiles and executes code 10 times. Bummerino. That’s the life of a loop.
On SQL Server 2016 (and really, with any non-looping code), we can get around the constant compilations with a simple rewrite. In this case, I’m calling 2016’s STRING_SPLIT function instead of the MSTVF function.
Read on for more.
If we remember, for the CE 120 it was a one row estimate, and in this case server decided, that it is cheaper to use a non-clustered index and then make a lookup into clustered. Not very effective if we remember that our predicate returns all rows.
In CE 130 there was a 365 rows estimate, which is too expensive for key lookup and server decided to make a clustered index scan.
But, wait, what we see is that in the second plan the estimate is also 1 row!
That fact seemed to me very curious and that’s why I’m writing this post. To find the answer, let’s look in more deep details at how the optimization process goes.
This was an interesting look at how the optimizer looks at scalar user-defined functions.