Clustered Index And Physical Storage

Wayne Sheffield busts a myth:

In several of my last few blog posts, I’ve shared several methods of getting internal information from a database by using the DBCC PAGE command and utilizing the “WITH TABLERESULTS” option to be allowed to automate this process for further processing. This post will also do this, but in this case, we’ll be using it to bust a common myth—data in a clustered index is physically stored on disk in the order of the clustered index.

Busting this myth

To bust this myth, we’ll create a database, put a table with a clustered index into this database, and then we’ll add some rows in random order. Next, we will show that the rows are stored on the pages in logical order, and then we’ll take a deeper look at the page internals to see that the rows are not stored in physical order.

Read on for the proof.

Thinking About Contexts

Ewald Cress has started a new series on system internals:

Per-processor partitioning of certain thread management functions makes perfect sense, since we’d aim to minimise the amount of global state. Thus each processor would have its own dispatcher state, its own timer list… And hang on, this is familiar territory we know from SQLOS! The only difference is that SQLOS operates on the premise of virtualising a CPU in the form of a Scheduler, whereas the OS kernel deals with physical CPUs, or at least what it honestly believes to be physical CPUs even in the face of CPU virtualisation.

This is a start to a very interesting series.

Thoughts On Cost-Based Optimizers

Joe Chang has the makings of an academic paper on the shortcomings of the current SQL Server cost optimizer model:

It might seem that given the pace of hardware change, such old model cannot possibly valid, resulting horrible execution plans. Around 1995 or so, the high-performance HDD was 7200RPM with a sequential bandwidth of 4-5MB/s. The mean rotational latency for 7200RPM is 4ms. An average seek time of 8.5ms seems reasonable, though I have not kept any documentation from this period. This would correspond to 80 IOPS per disk at queue depth 1 per HDD. So, it seems curious that the SQL Server cost model is based on the random IOPS of 4 disks, but the sequential IO of 2 HDDs.

Performance HDDs progressed first to 10K RPM around 1996, and then to 15K around 2000, with corresponding rotational latencies of 3 and 2ms respectively. Average seek time was reduced over time to 3ms with developments in powerful rare-earth magnets. The 10K HDD could support 125 IOPS and 200 IOPS for the 15K HDD. But no further progress was made on HDD rotational speed. In same time period, hard disk sequential IO phenomenally quickly exceeding 100MB/s in the fourth generation 15K HDD around 2005.

In other words, the SQL Server cost model is based on a 1350/320 = 4.2 ratio. But 15K HDDs in the mid-2000’s were 100MB/s × 128 pages/MB = 12,800 pages/sec sequential to 200 IOPS random for a ratio of 64:1. It turns out that achieving nearly the sequential IO capability of HDDs from SQL Server required a special layout strategy, as outlined in the Fast Track Data Warehouse Reference Architecture papers, which few people followed. This was due to the fixed, inflexible IO pattern of SQL Server, which required the disk configuration to match that of SQL Server instead of being able to adjust the IO pattern to match that of the storage configuration.

It’s worth taking the time to read.  I like Glenn Berry’s idea in the comments of building relative CPU/IO/memory measures and applying them rather than using the same values that were good for twenty years ago.

Virtual Function Calls

Ewald Cress is thinking about virtual function calls:

A virtual function call, on the other hand, is only resolved at runtime. The compiler literally does not know what address is going to get called, and neither does the runtime except in the heat of the moment, because that is going to depend on the type of the object instance that the function is called on. Bear with me, I’ll try and simplify.

A C++ object is just a little chunk of memory: a bunch of related instance variables if you like. All objects of the same class have the same structure in this regard. If you’re wondering about functions (a.k.a. methods), these belong to the class, or put differently, to ALL objects of that class. Once compiled, each method is a chunk of memory with a known address, containing the compiled instructions.

From there, it’s a harrowing journey through bigger layers of indirection.

Looking For Wait Types

Ewald Cress uses the debugger to search for particular waits:

In this case I was looking for PREEMPTIVE_COM_RELEASE, and sys.dm_xe_map_values tells me that on my 2014 RTM instance it has an index of 01d4 hexadecimal. Crazy as it sounds, I’m going to do a simple search through the code to look for places that magic number is used. As a two-byte (word) pattern we’ll get lots of false positives, but fortunately wait types are internally doublewords, with only one bit set in the high-order word. In other words, we’re going to look for the pattern 000101d4, 000201d4, 000401d4 and so forth up to 800001d4. Ignore the meaning of when which bit is going to be set; with only sixteen permutations, it’s quick enough to try them all.

Let’s focus on sqllang as the likely source – the below would apply to any other module too.

This post reminds me that my debugger skills aren’t very good.

Understanding Status In DMVs

Ewald Cress looks at a number of DMVs and how they expose query status:

sys.dm_exec_sessions: status

This metric is completely disjunct from the above ones, and mostly reflects attributes of a CSession class instance. The respective values are derived through the following decision tree:

  • If the internal Boolean member m_fIsConnReset is set, return dormant

  • Else if a flag living outside of CSession itself is set, return preconnect (I’ll touch on the source of this mystery flag below)

  • Else if a flag within the CSession itself is set (indicating that it has been provided some work to do) return running

  • Else return sleeping

It’s interesting to see how something so very similar can have so many different understandings.

SQL Server EXE Size

Arun Sirpal points out that sqlservr.exe is a lot smaller in 2012 and up as compared  to 2008 R2:

I never really noticed the difference before, but I understand why.

From 2012 onwards the architecture changed, it has been broken up into multiple DLLs. I can see the extra DLL files within the BINN folder these being sqllang.dll and sqlmin.dll where each are roughly 30MB each.

Makes me a bit curious as to the reason behind the breakout.

Thinking About Linux Internals

Anthony Nocentino speculates on the internals of SQL Server on Linux:

OK, so everyone wants to know how Microsoft did it…how they got SQL Server running on Linux. In this article, I’m going to try to figure out how.

There’s a couple of approaches they could take…a direct port or some abstraction layer…A direct port would have been hard, basically any OS interaction would have had to been looked at and that would have been time consuming and risk prone. Who comes along to save the day? Abstraction. The word you hear about a million times when you take Operating Systems classes in undergrad and grad computer science courses. 🙂

Anthony talks about picoprocesses, which causes me to say that containers (like Docker) are probably the most important administrative concept of the decade.  If you don’t fundamentally get the concept, learning it opens so many doors.

Preemptive Scheduling

Ewald Cress looks at preemptive scheduling:

Cooperative scheduling is a relay race: you simply don’t stop without passing over the baton. If you write code which reaches a point where it may have to wait to acquire a resource, this waiting behaviour must be implemented by registering your desire with the resource, and then passing over control to a sibling worker. Once the resource becomes available, it or its proxy lets the scheduler know that you aren’t waiting anymore, and in due course a sibling worker (as the outgoing bearer of the scheduler’s soul) will hand the baton back to you.

This is complicated stuff, and not something that just happens by accident. The textbook scenario for such cooperative waiting is the traditional storage engine’s asynchronous disk I/O behaviour, mediated by page latches. Notionally, if a page isn’t in buffer cache, you want to call some form of Read() method on a database file, a method which only returns once the page has been read from disk. The issue is that other useful work could be getting done during this wait.

Read on for a detailed example looking at xp_cmdshell.

Scheduler Stories

Ewald Cress has a couple of posts about the scheduler.  First, fiber mode scheduling:

The title of this post is of course an allusion to Ken Henderson’s classic article The perils of fiber mode, where he hammers home the point that fiber scheduling, a.k.a. lightweight pooling, appears seductive until you realise what you have to give up to use it.

We’ll get to the juicy detail in a moment, but as a reminder, the perils of fibers lie in their promiscuity: many fibers may share one thread, its kernel structures and its thread-local storage. This is no problem for code that was written with fibers in mind, including all of SQLOS, but unfortunately there are bodies of code for which this isn’t true.

Next up is the Windows scheduler:

Hardware interrupts, which run in kernel mode and return to user mode quickly, should be nothing more than tiny hiccups in a running thread’s quantum. The other 90% of the interrupt iceberg manifests in user mode as Deferred Procedure Calls (DPCs, or “bottom halves” to the Linux crowd) but should still only steal small change in terms of CPU cycles. Context switches to another thread represent a completely different story, because it could be ages before control returns to our thread, meaning that our fiber scheduler is completely out of commission for a while.

This possibility – a SQLOS scheduler losing the CPU for an extended period – is just one of those things we need to live with, but on a sane server, it shouldn’t be something to be too concerned about. Consider that this happens all the time in virtualised environments, where our vCPU can essentially cease to exist while another VM has a ride on the physical CPU.

These are fairly long reads, but we’re getting to levels where you can see these settings in the Database Engine (like Lightweight Pooling).

Categories

February 2017
MTWTFSS
« Jan  
 12345
6789101112
13141516171819
20212223242526
2728