Ultimately I think any thought of the readable secondary having a vastly different plan was a red herrings. Statistics are going to be the same on both instances, and if there were a missing statistic on the secondary, SQL Server would create it in TempDB. Anyway, columnstore indexes don’t use statistics in the traditional sense.
Fortunately I was able to catch a query in the process of waiting on HTDELETE, so I no longer had to look for the needle in the haystack, and I could get to tuning the plans. I was able to grab the SELECT part of the query and generate an estimated plan on both the primary and secondary nodes. The plans were virtually the same on both nodes, with just a minor difference in memory grant between them.
Click through for the solution.
Our query memory grants range from around 8 MB to around 560 MB. This isn’t even ordering BY the larger columns, this is just doing the work to sort results by them. Even if you’re a smarty pants, and you don’t use unnecessary ORDER BY clauses in your queries, SQL may inject them into your query plans to support operations that require sorted data. Things like stream aggregates, merge joins, and occasionally key lookups may still be considered a ‘cheaper’ option by the optimizer, even with a sort in the plan.
Of course, in our query plans, we have warnings on the last two queries, which had to order the VARCHAR(8000) column.
This shows just how much difference a simple column size can make.
Once the pattern is down, the use is pretty straightforward. There’s also more options accessible to you. If we just look at the RunTimeCountersPerThread node, we can compare other values such as Rows, Scans, and CPU time. We could really get crazy and extract all the different statements within the batch. There are numerous possibilities for analysis and review.
I’m not here to tell you that you should start using PowerShell to automate query tuning. Query performance is an art form and requires a lot of case-by-case analysis. However, like any great carpenter, it’s good to know the capabilities of your tool set. Understanding the options available to you not only helps you be more effective, but can also provide answers you may not have had access to.
It’s another tool for the belt.
So yeah, that function seems to get up to something once for every ID you pass in. Remember that in our STUFF… query, we grabbed the TOP 10 each time. In the XE session, each time we call the proc, the string splitting function compiles and executes code 10 times. Bummerino. That’s the life of a loop.
On SQL Server 2016 (and really, with any non-looping code), we can get around the constant compilations with a simple rewrite. In this case, I’m calling 2016’s STRING_SPLIT function instead of the MSTVF function.
Read on for more.
If we remember, for the CE 120 it was a one row estimate, and in this case server decided, that it is cheaper to use a non-clustered index and then make a lookup into clustered. Not very effective if we remember that our predicate returns all rows.
In CE 130 there was a 365 rows estimate, which is too expensive for key lookup and server decided to make a clustered index scan.
But, wait, what we see is that in the second plan the estimate is also 1 row!
That fact seemed to me very curious and that’s why I’m writing this post. To find the answer, let’s look in more deep details at how the optimization process goes.
This was an interesting look at how the optimizer looks at scalar user-defined functions.
There is a DMV that isn’t used a lot of the time because the information within it frequently doesn’t have a lot of bearing on solving fundamental query tuning issues such as out of date statistics, bad or missing indexes, or poorly structured T-SQL. This DMV, sys.dm_exec_plan_attributes, contains a bunch of values that are used by the optimizer to identify a plan in cache, such as object_id (if any), database_id, and compatibility level (compat_level). In addition to these clear & easy to understand attributes, there’s one more, set_options, that’s not immediately clear.
Read on for more information and a sample call.
I opened it up, and sure enough, no sign of that 7,276 value. It looks just the same as the estimated plan I just showed.
Getting plans out of the cache is where the estimated values come into their own. It’s not just that I’d prefer to not actually run potentially-expensive queries on customer databases. Querying the plan cache is one thing, but running queries to get the actuals – that’s a lot harder.
With SQL 2016 SP1 installed, thanks to that Connect item, I can now see the Estimated Number of Rows to be Read property in estimated plans, and in the plan cache. The operator tooltip shown here is taken from the cache, and I can easily see that Estimated property showing 7,276. This is shown from Management Studio because Plan Explorer doesn’t yet call out this property explicitly:
If you’re looking to use SQL Server 2016 SP1, read the whole thing; this will make query tuning without running those horribly expensive queries a bit easier.
I would like to see a hint that causes the optimiser to consider a parallel plan no matter the cost of the query. It’s possible to get this behaviour with trace flag 8649 but it’s unsupported by Microsoft and therefore unfit for production use.
I only tend to use query hints as a very last resort. It’s almost always better to allow the optimiser to make these decisions and continue to reevaluate these decisions as your data changes but sometimes they can be a get out of jail free card.
Click through for the full argument, and then hit the Connect item if you agree.
On my VM with 4 cores it takes 33 seconds to execute this query on SQL Server 2016 with Service Pack 1, while it burns almost 48 seconds of the CPU Time.
The relevant part of the execution plan can be found below, showing so many performance problems that this query is suffering, such as INNER LOOP JOIN, INDEX SPOOL, besides even worse part that is actually hidden and is identifiable only once you open the properties of any of the lower tree (left side of the LOOP JOIN), seeing that it all runs with the Row Execution Mode actually.
To show you the problem, on the left side you will find the properties of the sort iterator that is to be found in the lower (left) part of the LOOP Join that was executed around 770.000 times in the Row Execution Mode, effectively taking any chances away from this query to be executed in a fast way. One might argue that it might that it might be more effective to do the loop part in Row Mode, but given that we are sorting around 3.1 Million Rows there – for me there is no doubt that it would be faster to do it within a Batch Execution Mode. Consulting the last sort iterator in the execution plan (TOP N SORT), you will find that it is running with the help of the Batch Execution Mode, even though it is processing around 770.000 rows.
There’s some valuable information here.
As you might guess, things may not always get evenly accessed, even if you have evenly sized tempdb files. One of my queries did a select into a temp table. Although it used all four tempdb files whether or not it went parallel, there were more file_read events against the first tempdb file than against the other four.
It’s an interesting look at this specific question, but also as pedagogical technique.