New Diagnostics For Synchronous Statistics Updates

Joe Sack announces a new wait type and request command:

Consider the following query execution scenario:

  • You execute a SELECT query that triggers an automatic synchronous statistics update.
  • The synchronous statistics update begins execution and your query waits (is essentially blocked) until the fresh statistics are generated.
  • The query compilation and execution does not resume until the synchronous statistics update operation completes.

During this time, there are no external signs via common troubleshooting channels that the query is specifically waiting for the synchronous statistics update operation to complete.  If the statistics update takes a long time (due to a large table and\or busy system), there is no easy way to determine root cause of the high duration.

This is now available in SQL Server 2019 CTP 2.1.  Read the whole thing.

Taking Action With Wait Stats

Aaron Bertrand lays out a course of action (or inaction) when dealing with the most common wait types in SQL Server:

I started going a little further than this, mapping out some of the more common wait types, and noting some of the properties they shared. Translated into questions a tuner might have about a wait type they are experiencing:

  • Can the wait type be solved at the query level?
  • Is the core symptom of the wait likely to be affecting other queries?
  • Is it likely you will need more information outside the context of a single query and the wait types it experienced in order to “solve” the problem?

When I set out to write this post, my goal was just to group the most common wait types together, and then start jotting notes about them relating to the above questions. Jason pulled the most common ones from the library, and then I drew some chicken scratch on a whiteboard, which I later tidied up a bit. This initial research led to a talk that Jason gave on the most recent TechOutbound SQL Cruise in Alaska. I’m kind of embarrassed that he put a talk together months before I could finish this post, so let’s just get on with it. Here are the top waits we see (which largely match Paul’s survey from 2014), my answers to the above questions, and some commentary on each:

Read on for the top 10 list.

SOS_WORK_DISPATCHER

Joe Obbish digs into a new wait type in SQL Server 2019:

Upon upgrading to SQL Server 2019 CTP2, you may see the new SOS_WORK_DISPATCHER wait type at the top of the list:

The above screenshot is server level wait stats from my four core desktop after SQL Server was running for a few hours. SQL Server wasn’t really doing much since start up, so it felt unlikely that this wait was a sign of a problem. However, I was curious about what this wait type meant and wanted to know more.

Click through for Joe’s findings and what you should do with this wait type.

What To Watch When Using VSS Snapshots

Erik Darling shows us the wait stats associated with the Volume Shadow Copy Service (VSS):

A while back I wrote about the Perils of VSS Snaps.

After working with several more clients having similar issues, I decided it was time to look at things again. This time, I wanted blood. I wanted to simulate a slow VSS Snap and see what kind of waits stats I’d have to look out for.

Getting software and rigging stuff up to be slow would have been difficult.

Instead, we’re going to cheat and use some old DBCC commands.

This one almost got the “Wacky Ideas” tag but I’m grading on a curve for that category.

Storing Wait Stats In tempdb

Max Vernon has a script which loads a bunch of wait stats definitions and then collects wait stat details:

Performance troubleshooting should begin with capturing wait stats so we can understand where SQL Server is busy. The script below captures wait stats into a table in tempdb; the script should be ran via a SQL Server Agent job or some other scheduling mechanism.

I like the definitions that Max provides.  My only recommendation would be to store this data someplace a bit more permanent than tempdb.

Wanted: Per-Database Wait Stat Collection Built In

Erik Darling wants configurable wait stat collections on a database level built into SQL Server:

I’m hoping that a feature like this could solve some intermediate problems that Query Store doesn’t.

Namely, being lower overhead, not collecting any PII, and not taking up a lot of disk space — after all, we’re not storing any massive stored proc text or query plans, here, just snapshots of wait stats.

This will help even if you’re already logging wait stats on your own. You still don’t have a clear picture of which database the problem is coming from. If you’ve got a server with lots of databases on it, figuring that out can be tough.

Understanding what waits (and perhaps bottlenecks) a single database is experiencing can also help admins figure out what kind of instance size they’d need as part of a migration, too.

It’s an interesting approach.  If you agree with Erik, go vote.

When Wait Stats Aren’t Enough

Joe Obbish has an example of diagnosing performance problems when wait stats don’t indicate any problems:

In summary, page allocations and page free events rapidly occur, sometimes in an alternating pattern. SQL Server will often free a number of pages just to immediately request allocations for a similar number of pages. If all of the free page events result in returned memory to the OS then the reason for the scalability bottleneck becomes clear. When running the full workaround with 96 concurrent sessions, a total of 341965 page freed operations were performed. Those events freed about 71.3 million pages in total. That amounts to about 584 GB of memory returned to the OS in total, based on the previous assumptions.

This is a great investigation into the depths of debugging in SQL Server.  Joe wasn’t able to get a definitive solution to his problem, but he showed us a lot along the way.

CLR_MANUAL_EVENT Waits

Jonathan Kehayias traces out the cause of CLR_MANUAL_EVENT waits on SQL Server:

The fact that no data has been collected for this type throughout a good cross-section of their customers really confirmed for me that this isn’t something that is commonly a problem, so I was intrigued by the fact that this specific workload was now exhibiting problems with this wait. I wasn’t sure where to go to further investigate the issue so I replied to the email saying I was sorry that I couldn’t help further because I didn’t have any idea what would be causing literally dozens of threads performing spatial queries to all of sudden start having to wait for 2-4 seconds at a time on this wait type.

A day later, I received a kind follow-up email from the person that asked the question that informed me that they had resolved the problem. Indeed, nothing in the actual application workload had changed, but there was a change to the environment that occurred. A third-party software package was installed on all of the servers in their infrastructure by their security team, and this software was collecting data at five-minute intervals and causing the .NET garbage collection processing to run incredibly aggressively and “go nuts” as they said. Armed with this information and some of my past knowledge of .NET development I decided I wanted to play around with this some and see if I could reproduce the behavior and how we could go about troubleshooting the causes further.

Read the whole thing if you use CLR.

HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING Wait Type

Chirag Shah explains what the HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING wait type really means:

Recently a customer reported an interesting issue, while querying against recently added readable replica, SELECT statement is shown as suspended and session is shown as waiting on HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

[…]

Upon more investigation, it appeared to be waiting on with a wait type HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

The behavior is by design as mention in the SQL Server product documentation and applicable to all version of SQL Server that supports availability group.

Read on for the explanation.

FCB_REPLICA_SYNC Spinlock Explanation

Paul Randal explains what the FCB_REPLICA_SYNC spinlock is and what it does:

In a nutshell, this spinlock is used to synchronize access to the list of pages that are present in a database snapshot, as follows:

  • If a page in a database with one or more database snapshots is being updated, check each snapshot’s list to see if the page is already in the snapshot. If yes, nothing to do. If no, copy the pre-change image of the page into the snapshot.
  • If a query is reading a page in the context of a database snapshot, check the list of pages to see whether to read from the snapshot or the source database.

This synchronization ensures that the correct copy of a page is read by a query using the snapshot, and that updated pages aren’t copied to the snapshot more than once.

The original question was because the person was seeing trillions of spins for the FCB_REPLICA_SYNC spinlock. That’s perfectly normal if there’s at least one database snapshot, a read workload on the snapshot, and a concurrent heavy update workload on the source database.

Great information.  And a good reminder that if you are using database snapshots in SQL Server, you generally don’t want to have more than one on the same database.

Categories

December 2018
MTWTFSS
« Nov  
 12
3456789
10111213141516
17181920212223
24252627282930
31