Ewald Cress looks at the SOS_WaitableAddress class next in his series on internals:

All the synchronisation mechanisms I have discussed so far have one things in common: in order to be globally visible, they involve synchronisation objects embedded within the things they protect. So for instance, if we want to protect the global scheduler list with a spinlock, that spinlock lives within the global scheduler list, and allocating the spinlock’s storage (lightweight as it is) is the responsibility of whoever creates that list. But what if there were millions of things we might occasionally want to lock, and we don’t want to embed a lock within each such item due to the hassle and/or overhead involved? What if we just wanted a publicly visible corkboard upon which we can pin notes describing the items which are currently locked?

This is a rather different locking structure than we’ve seen so far.  Read on for details, including magic within the Signal method.

2016 SOS_RWLock

Ewald Cress continues his series on internals, and looks at how SOS_RWLock has changed in SQL Server 2016:

Allow me to call out some layout comparison points against the 2014 version:

  • There is no separate member to track the shared reader count.

  • The four-byte spinlock is gone.

  • The four-byte waiting writer count is gone.

  • The two chunks of four-byte padding (for qword alignment of pointers) are gone.

  • The WaitListEntry structure hasn’t changed at all.

Ewald also covers Compare-And-Swap operations in detail.  Definitely a good read.


Ewald Cress looks at SOS_RWLock, a reader-writer lock (at least the pre-2016 version):

This lock class can best be appreciated by comparing it to a mutex. Like the mutex, a reader-writer lock can only be acquired in exclusive mode by one requestor at a time, but instead of only exposing this exclusive-acquire (Writer) option, it alternatively allows acquisition in shared (Reader) mode. This stuff is completely natural to us database folks of course, because the semantics is a subset of the behaviours we get from familiar database locks.

Basic rules of the road:

  • Any number of simultaneous “clients” can share ownership of the lock in Read mode.

  • Readers block writers.

  • Writers block readers and other writers.

  • Blocking means that the requesting worker gets suspended (scheduled off the processor) and accrues a wait of a type specified in the lock acquisition request.

There’s a huge amount of detail here, and I for one am glad that there isn’t a quiz later..


Ewald Cress continues his dive into system internals, this time looking at SOS_Mutex:

Put differently, we can build a mutex from an auto-reset EventInternal by tacking on an owner attribute, making a rule that only the owner has the right to signal the event, and adding assignment of ownership as a fringe benefit of a successful wait. A nonsignalled event means an acquired mutex, and a signalled event means that the next acquisition attempt will succeed without waiting, since nobody currently owns the mutex. The end result is that our SOS_Mutex class exposes the underlying event’sSignal() method and its own take on Wait(). From the viewpoint of the mutex consumer, the result of a successful wait is that it owns the mutex, and it should act honourably by calling Signal() as soon as it is done using the resource that the mutex stands guard over.

There’s some deep detail here, so this is definitely one of those “after your first cup of coffee” posts to read.


Ewald Cress continues to dig into scheduling, this time looking at EventInternal:

signalMode adds a twist. The behaviour described for the traffic light corresponds to a signal mode of 0, also known as a manual reset event. Here the event stays signalled irrespective of how many consumers pass through it (=successfully wait on it).

A signal mode of 1, however, turns it into an auto-reset event, where the act of successfully waiting on the event resets it to unsignalled. This is now more akin to a turnstile that only lets one person through after being signalled, e.g. by a scan of a valid transport pass or a button press by a security guard.

Interestingly, a event object is also sometimes known as a latch – that’s something to chew on for SQL Server folks. Don’t get hung up about who or what signals it; that is a separate issue altogether. Just keep in mind that the signal mode is a permanent attribute of the event – you construct it as manual reset or auto-reset. Full disclosure: there seems to be at least one more SignalMode (2, used by the related SOS_WaitableAddress), but let’s ignore it today.

This is part of a great series, and I hope Ewald keeps it up.  I’d probably drop a few bucks on a cleaned up and edited version of his discussion of internals in an 80-page or so e-book.


Ewald Cress looks at SystemThread:

SystemThread, a class within sqldk.dll, can be considered to be at the root of SQLOS’s scheduling capabilities. While it doesn’t expose much obviously exciting functionality, it encapsulates a lot of the state that is necessary to give a thread a sense of self in SQLOS, serving as the beacon for any code to find its way to an associated SQLOS scheduler etc. I won’t go into much of the SQLOS object hierarchy here, but suffice it to say that everything becomes derivable by knowing one’s SystemThread. As such, this class jumps the gap between a Windows thread and the object-oriented SQLOS.

The simplest way to conceptualise a thread in SQL Server is to think of a spid or connection busy executing a simple query, old skool sysprocesses style. It’s not hip, but it’s close enough to the truth to be useful. This conflates a few things that are separate entities, but it is a good starting point for teasing things apart as needed

This is another dive into internals; prepare your thinking caps.

Thread-Local Storage

Ewald Cress digs into fundamentals:

As a teaser for where this is heading, I’ll reframe the problem as classic SQL Server examples. Firstly, when a latch wait occurs somewhere in the bowels of a LatchBase subclass instance, how does that latch method know to track the wait against an instance of a Worker, or make it known to the world that it is holding up that Worker? And secondly, at a much higher abstraction level, when a task executes a user query and needs to access a table, how does the access methods code know what security principal to do security checks against? We are taking the first steps towards answering these questions here.

I enjoy Ewald’s explanations because when I’m done, I really feel like I have a clue of what’s going on.  It all fades away as soon as I look away from the screen, but that’s on me, not him.

Linked Lists

Ewald Cress digs into linked lists to explain (deep) SQLOS internals:

The memory layout of a linked list doesn’t imply specific usage semantics. If we consistently insert at the head and remove from the tail, we have a queue. If we both insert and remove items from the head, we have a stack. And it is possible to have variations of these as well.

Finally, it is clear that insert and remove operations are multi-step, and the list is in an inconsistent state – i.e. not safe to traverse or modify – in the middle of such an operation. For this reason, locking semantics must be implemented. This will typically take the form of a spinlock which must be aquired before trying to access the list for any purpose. The object which owns the list head will then normally have a spinlock as a data member associated with the list head, although it is possible to have one spinlock protect multiple items beyond just a single linked list; this could be a sign of sane design, but conversely it means a coarser locking grain, which can sometimes work against you.

Even at this “simple” level, we’re digging pretty deep here.

Analyze Transaction Log Hierarchy

Paul Randal has written code to determine how much log an operation has generated:

Over the weekend there was a discussion on the MVP distribution list about the sys.dm_tran_database_transactions DMV and how one cannot use it to accurately determine how much log an operation has generated because it doesn’t provide a roll-up of the sub-transaction metrics to the outer transaction. This makes the output somewhat non-intuitive.

The discussion prompted me to write some code I’ve been meaning to do since 2012, when SQL Server 2012 introduced a field in LOP_BEGIN_XACT log records that tracks the transaction ID of the parent transaction, allowing the hierarchy of transactions to be investigated.

The actual code is at the bottom of the article, and is available in a zip file here.

It provides two stored procs, sp_SQLskillsAnalyzeLog and sp_SQLskillsAnalyzeLogInner, with the former making use of the latter, and the latter calling itself recursively.

I’d consider this squarely in the advanced troubleshooting realm.  Definitely read the whole thing and Paul’s code.

Updated Scheduling

The CSS SQL Server Engineers team talks about a new scheduling algorithm in SQL Server 2016:

SQL Server 2016 gets a scalability boost from scheduling updates.   Testing uncovered issues with the percentile scheduling based algorithms in SQL Server 2012 and 2014.  A large, CPU quantum worker and a short, CPU quantum worker can receive unbalanced access to the scheduling resources.

Take the following example.  Worker 1 is a large, read query using read ahead and in-memory database pages and Worker 2 is doing shorter activities.   Worker 1 finds information already in buffer pool and does not have to yield for I/O operations.    Worker 1 can consume its full CPU quantum. 

On the other hand, Worker 2 is performing operations that require it to yield.  For discussion let’s say Worker 2 yields at 1/20th of its CPU, quantum target.  Taking resource governance and other activities out of the picture the scheduling pattern looks like the following.

I’m going to have to reserve judgment on this.  It’s been in Azure SQL Database for a while, so I’m not expecting bugs, but I wonder if there are any edge cases in which performance gets worse as a result of this.


July 2017
« Jun