We found that only at the beginning of the run, there was contention on memory grants (RESOURCE_SEMAPHORE waits), for a short period of time. After that and later into the process, we could see some latch contention on regular data pages, which we didn’t expect as each thread was supposed to insert into its own row group. You would also see this same data by querying sys.dm_exec_requests live, if you caught it within the first minute of execution, as displayed below.
This is useful in case you run into the issue, but also useful as a case study on effective troubleshooting.
Previously I dug into preemptive waits in SQLOS, and to be honest, I equated “preemptive” with “external”. For the most part the two go hand in hand after all.
To recap, a preemptive wait isn’t necessarily a wait at all. What happens is that a worker needs to run some code that can’t be trusted to play by cooperative scheduling rules. And rather than put the SQLOS scheduler (and all its sibling workers) at the mercy of that code, the worker detaches itself from the scheduler and cedes control to a sibling runnable worker.
Read the whole thing.
Erik Darling has a couple more posts on adaptive joins in SQL Server 2017. First, he wonders what happens when you add scalar functions to the mix:
See, this isn’t SARGable either (and no, SCHEMABINDING doesn’t change this). When a predicate isn’t SARGable, you take away an index seek as an access choice. You don’t see too many Nested Loops with an index scan on the other end, do you?
So there you go. It’s not the function itself that bops our Adaptive Join on the head, but the lack of SARGability.
I do have to point out that Cross Apply used to only be implemented as a Nested Loops Join. I learned that many years ago from one of the best articles written about Cross Apply by Paul White. That changed recently — it’s possible to see it implemented with a Hash Join in at least 2016. I’ve seen it crop up in Cross Apply queries without a TOP operator.
The latter results are a bit surprising.
This optimization provides a great boost with a sufficient number of rows. You can read more about its test results in the blog OPTIMIZED Nested Loops Joins, created by Craig Freedman, an optimizer developer.
However, if the actual number of rows is less than the expected one, then CPU additional costs to build this sort may hide its benefits, increase CPU consumption and reduce its performance.
Read the whole thing. I think the likelihood of using either this hint or the trace flag is near nil, but crazy things do come up.
We have three types of physical join algorithms in SQL Server: hash, nested loops and merge. Adaptive join allows SQL Server automatically choose an actual physical algorithm on the fly between the first two – hash (HM) and nested loops (NL).
NL has two join strategies – naive nested loops join (inner loop scans the whole inner table or index) and index nested loops join (index on the join column of the inner table is used to find necessary rows and then those rows are applied to the outer row, also called Nested Loops Apply). Typically, the second one performs very well if you have rather small input on the outer side and indexed rather big input on the inner side.
HM is more universal and uses hash algorithms to match rows, so no indexes are necessary. You may refer to my blog post Hash Join Execution Internals for more details.
Adaptive Join starts execution as a Hash Join. It consumes all the input of the build phase and looks at the adaptive join threshold, if the number of rows is more or equal this threshold it will continue as a hash join. However, if the number of rows is less than this threshold, it will switch to a NL.
If you want to get a better understanding of how adaptive joins works, Dmitry’s post is a great start.
The focal point of the mutex’s state – in fact one might say the mutex itself – is the single Spinlock bit within the 32-bit lock member. Anybody who finds it zero, and manages to set it to one atomically, becomes the owner.
Additionally, if you express an interest in acquiring the lock, you need to increment the WaiterCount, whether or not you managed to achieve ownership at the first try. Finally, to release the lock, atomically set the spinlock to zero and decrement the WaiterCount; if the resultant WaiterCount is nonzero, wake up all waiters.
Now one hallmark of a light-footed synchronisation object is that it is very cheap to acquire in the non-contended case, and this class checks that box. If not owned, taking ownership (the method SOS_UnfairMutexPair::AcquirePair()) requires just a handful of instructions, and no looping. The synchronisation doesn’t get in the way until it is needed.
However, if the lock is currently owned, we enter a more complicated world within the SOS_UnfairMutexPair::LongWait() method.
I love the statement that “This is not a very British class at all.” Read the whole thing.
And selecting the initial 10 rows can be demonstrated to return the 3rd column using the initial default set in step 3. (It makes no difference if any rows are added between steps 3 and 4.)
This means that there *must* be two default values stored when a new column is added: one for the set of already-existing rows that don’t have the new column and one for any new rows. Initially these two default values will be the same, but the one for new rows can change (e.g. in steps 4 and 5 above) with breaking the old rows. This works because after the new column is added (step 3 above), it’s impossible to add any more rows that *don’t* have the new column.
And this is exactly how it works. Let’s investigate!
In typical Paul Randal fashion, this is both a look at internals and an interesting explanation.
What have we here?
Of particular interest are last_sql_handle, query_hash, and query_plan_hash. It appears that we’ll finally be able to easily tie missing index requests to their queries, without doing a lot of painful XML processing. I had planned on adding something like this, but couldn’t find a good fit between 1) adding XML processing to sp_BlitzIndex, or adding more DMV queries and rather unpleasant XML processing to sp_BlitzCache. This will make implementing it far easier, assuming it works the way it looks like it will work.
Erik has three examples of interesting additions in CTP 2.0.
First of all, note that Estimated Number of Rows is 10 000 now, which is correct and equals Actual Number of Rows. Due to the correct estimate the optimizer decided that there are enough rows to benefit from a partial (local/global) aggregation and introduced a partial Hash aggregate before the join.
If you take a Profiler, enable events SP:StmtStarting, SP:StmtCompleted, SQL:StmtStarting, SQL:StmtCompleted and run the query without and with a TF, you’ll see what does it actually mean “interleaved” in terms of the execution sequence.
During the regular execution the query starts executing, then the function is executed, the query continues execution and finishes the execution. We see the following event sequence:
This is a very thorough post, but if you have multi-statement TVFs, you absolutely want to read it.
Joe Sack has a couple blog posts on adaptive query processing enhancements in SQL Server 2017 CTP 2.0. First, Batch Mode Adaptive Joins:
We have seen numerous cases where providing a specific join hint solved query performance issues for our customers. However, the drawback of adding a hint is that we remove join algorithm decisions from the optimizer for that statement. While fixing a short-term issue, the hard-coded hint may not be the optimal decision as data distributions shift over time.
Another scenario is where we do not know up front what the optimal join should be, for example, with a parameter sensitive query where a low or high number of rows may flow through the plan based on the actual parameter value.
With these scenarios in mind, the Query Processing team introduced the ability to sense a bad join choice in a plan and then dynamically switch to a better join strategy during execution.
That one’s really cool. Joe also talks about interleaved execution for multi-statement TVFs:
SQL Server has historically used a unidirectional “pipeline” for optimizing and executing queries. During optimization, the cardinality estimation process is responsible for providing row count estimates for operators in order to derive estimated costs. The estimated costs help determine which plan gets selected for use in execution. If cardinality estimates are incorrect, we will still end up using the original plan despite the poor original assumptions.
Interleaved execution changes the unidirectional boundary between the optimization and execution phases for a single-query execution and enables plans to adapt based on the revised estimates. During optimization if we encounter a candidate for interleaved execution, which for this first version will be multi-statement table valued functions (MSTVFs), we will pause optimization, execute the applicable subtree, capture accurate cardinality estimates and then resume optimization for downstream operations.
The goal here is to make Table-Valued Functions viable from a performance perspective. The team has a long way to go there, but they’re working on it. Also, Joe gives a shout out to Arun Sirpal, frequent curatee.