Category: Wait Stats

Recently a customer reported an interesting issue, while querying against recently added readable replica, SELECT statement is shown as suspended and session is shown as waiting on HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

[…]

Upon more investigation, it appeared to be waiting on with a wait type HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

The behavior is by design as mention in the SQL Server product documentation and applicable to all version of SQL Server that supports availability group.

Read on for the explanation.

Comments closed

FCB_REPLICA_SYNC Spinlock Explanation

Published 2018-06-29 by Kevin Feasel

Paul Randal explains what the FCB_REPLICA_SYNC spinlock is and what it does:

In a nutshell, this spinlock is used to synchronize access to the list of pages that are present in a database snapshot, as follows:

If a page in a database with one or more database snapshots is being updated, check each snapshot’s list to see if the page is already in the snapshot. If yes, nothing to do. If no, copy the pre-change image of the page into the snapshot.

If a query is reading a page in the context of a database snapshot, check the list of pages to see whether to read from the snapshot or the source database.

This synchronization ensures that the correct copy of a page is read by a query using the snapshot, and that updated pages aren’t copied to the snapshot more than once.

The original question was because the person was seeing trillions of spins for the FCB_REPLICA_SYNC spinlock. That’s perfectly normal if there’s at least one database snapshot, a read workload on the snapshot, and a concurrent heavy update workload on the source database.

Great information. And a good reminder that if you are using database snapshots in SQL Server, you generally don’t want to have more than one on the same database.

Comments closed

Understanding NESTING_TRANSACTION_FULL

Published 2018-06-20 by Kevin Feasel

Joe Obbish digs into the NESTING_TRANSACTION_FULL latch, explains what it does, and shows how it might be a performance bottleneck:

We know that only one worker can get the exclusive latch for the transaction at a time. Let’s use a greatly simplified model for what each parallel worker does for this query. It reads a row, does processing for a row, and goes on to the next one. Once it has enough rows to write out a log record it tries to acquire the latch. If no one else has the latch in exclusive mode it can get the latch, update some structure in the parent transaction, release the latch, and continue reading rows. If another worker has the latch in exclusive mode then it adds itself to the FIFO wait queue for the latch subresource and suspends itself. When the resource is available the worker status changes from SUSPENDED to RUNNABLE. When it changes again from RUNNABLE to RUNNINGit acquires the latch, updates some structure in the parent transaction, releases the latch, and continues working until it either needs to suspend again or hits the end of its 4 ms quantum. When it hits the end of its 4 ms quantum it will immediately select itself to run again because there are no other runnable workers on the scheduler.

So what determines the level of contention? One important factor is the number of workers that are contending over the same subresource. For this latch and type of query (rows are pretty evenly distributed between worker threads), this is simply MAXDOP. There’s a tipping point for this query where adding more workers is simply counterproductive.

For years I’ve seen people in the community state that running queries at MAXDOPthat’s too high can be harmful. I’ve always been after simple demos that show why that can happen. The NESTING_TRANSACTION_FULL latch is an excellent example of why some queries run longer if MAXDOP is increased too far. There’s simply too much contention over a shared resource.

Read the whole thing.

Comments closed

Demonstrating Page Latch Waits

Published 2018-06-11 by Kevin Feasel

Brent Ozar shows us how to generate PAGELATCH_UP and PAGELATCH_EX waits:

That’s what it looks like when your system is under heavy contention due to a lot of queries partying in table variables: in each second, each core on this system is spending 48 seconds waiting on PAGELATCH_UP. That’s awesome.

What about temp tables instead, you ask? Same symptoms: PAGELATCH_UP.

Read the whole thing.

Comments closed

The CXCONSUMER Wait Type

Published 2018-06-07 by Kevin Feasel

Paul Randal explains the CXCONSUMER wait type:

Back in September 2016 I created a Connect item (3102145) to split the CXPACKET wait into benign and actionable waits, so that when a CXPACKET wait happens, there’s really something to investigate. Unfortunately the Connect website has been shut down by Microsoft so I can’t point you to the original request I made (I also tried in the Internet Archive but couldn’t find it).

The reason I wanted this change is that CXPACKET waits have always been registered by both producer and consumer threads for query plan operators where some threads produce data (i.e. producer threads) and some threads consume the produced data (i.e. consumer threads). However, the waits for the consumer threads are not actionable, because it’s the *producer* threads that are the cause of the consumer thread waits, and so it’s the producer thread waits that are actionable. By splitting the consumer waits out, the number of CXPACKET waits should be reduced, and those that are left should be investigated.

During the PASS Summit in 2017, my friend Pedro Lopes (b|t) on the Tiger Team announced that they’d made the change. The new wait type, CXCONSUMER, was initially added to SQL Server 2017 RTM CU3 and was added to SQL Server 2016 SP2 in April. You can read Pedro’s blog post here. Note that there was initially a bug with the implementation which was fixed in SQL Server 2017 RTM CU4.

It’s nice to see this split, as CXPACKET has been one of the most ambiguous wait types for a while, due to it having multiple potential meanings.

Comments closed

Getting Wait Info From Extended Events

Published 2018-06-05 by Kevin Feasel

Grant Fritchey shows how to get wait information for particular sessions from the system_health extended event:

On your servers, any of them that are SQL Server 2008 or newer, right now, unless you’ve performed actions to prevent this, you’re running the Extended Events system_health session. It’s just happening, currently, on all your servers. Nothing you need to do about it at all. I’ll be a lot of you never even knew it was there.

If you follow the link you can see all the various types of information being gathered by the Extended Event system_health session. I won’t detail all of it here. Let me just provide a little context around how the session works. First and foremost, similar to the error log, this session consists of four files, each 5mb in size, rolling over as they get filled. For systems with a very high degree of activity, that means the information here may only be hours old. However, for most of us, this provides days, if not weeks worth of information about the behavior of your system.

The system_health extended event misses a lot of stuff, but it’s quite useful when you don’t have a purpose-built monitoring solution in place.

Comments closed

Some Waits Just Need Ignoring

Published 2018-05-24 by Kevin Feasel

Paul Randal explains that not all SQL Server wait types are pernicious:

Wait statistics analysis is one of my favorite things to talk about because it’s so incredibly useful for performance tuning and can dramatically shorten the time it takes to zero in on the root cause of a performance problem. But you have to do it correctly. You can’t just do a SELECT * FROM sys.dm_os_wait_stats. Various people have published scripts online to aggregate and display wait statistics in an actionable way, and my script is one of the most popular (latest version is always in this post).

One question I’m often asked is why does my script have a list of wait types that it specifically filters out? The answer is that those wait types are what I call ‘benign’ – they’re usually not a problem but happen frequently enough from regular SQL Server operations that they would show up as the top waits and so would obscure the waits that you can do something about.

Read on for the rest of the story.

Comments closed

Welcome, CXCONSUMER

Published 2018-01-12 by Kevin Feasel

Erik Darling points out that CXCONSUMER is now a wait type in SQL Server:

According to Pedro’s slide, but not the ENTIRELY MISSING DOCUMENTATION, this wait is the “safe” type of parallelism wait.

It’s a good thing Pedro is a dutiful blogger, so we don’t have to pull our hair out while unfurling these mysteries.

Speaking of documentation, our new CXCONSUMER friend isn’t mentioned in Query Store Wait Stats, either.

This is a very useful addition.

Comments closed

Getting Wait Stats From The Query Store

Published 2017-11-28 by Kevin Feasel

Erin Stellato shows how to retrieve wait stats per query from Query Store:

In SQL Server 2016 a new DMV is exposed, sys.dm_exec_session_wait_stats, which provides information about waits for an existing, active session. If you know the session_id, you can track waits for a query when it starts and when it completes (snapshot the information at the beginning and end of the query and then diff the information). The challenge is that you have to know the session_id for the query, and you have to set up data capture in advance – which isn’t trivial when you’re in the midst of a high priority issue.

Wait statistics information exists in an actual execution plan starting in SQL Server 2016 SP1. Only the top 10 waits are captured, and there are limitations in terms of what this data represents. For example, CXPACKET is ignored and not included in the output, but it will be included in 2016 SP2 and 2017 CU3 and up – where irrelevant parallelism waits are instead captured by CXCONSUMER (which will not be included in actual plan waits).

So how can we see what a specific query is truly waiting on? We can use Query Store! SQL Server 2017 includes the capture of wait statistics information in Query Store, and this functionality is also available in Azure SQL Database. Wait statistics are tied to a query plan, and are captured over time, just like the runtime statistics. The addition of wait statistics information in Query Store was the number one feature request after its initial release, and all that information together creates powerful troubleshooting capabilities.

Read on for a demo-filled post. It’s hard to overstate just how useful this information is for query tuning.

Comments closed

Making CXPACKET The Bad Guy

Published 2017-11-14 by Kevin Feasel

Pedro Lopes goes into detail regarding the new CXCONSUMER wait stat:

It’s important to mention a couple notes before getting into details:

Parallelism use is intrinsically a benefit for costly queries, bound by the Cost Threshold for Parallelism and Max Degree of Parallelismserver configurations among other considerations. See the Query Processing Guide for more in-depth information on how SQL Server implements parallelism. Also refer to Craig Freedman’s blog series on the topic.

Waits are a normal part of the waits and queues model, allowing SQL Server to concurrently execute many more requests than there are schedulers available. See more about waits and queues here,

It’s also important to understand that parallelism is implemented as if it were two operators. There’s the producers that push data to consumers, and the consumers that may have to wait for data from producers.

And so it becomes easier to understand how producer waits are the ones that may require attention, while consumer waits are inevitable as a passive consequence of longer running producers.

I actually would have preferred the rename go the other way and create a CXPRODUCER which is bad, rather than a CXCONSUMER which is benign. The problem is that if you’re running in a heterogeneous environment, you have to remember that CXPACKET on Server A might indicate a problem, whereas it’s much fuzzier on Server B because you haven’t upgraded Server B to the latest CU yet. Regardless, I’m happy that we’re going to be able to have a better idea when parallelism is actually a problem.

Comments closed