Press "Enter" to skip to content

Category: Internals

Scheduler Timing

Ewald Cress continues his look at schedulers:

To simplify things initially, we’ll forget about hidden schedulers and assume hard CPU affinity. That gives us an execution environment that looks like this:

  • Each CPU is physically tied to a scheduler.

  • Therefore, out of all the workers in the system, there is a subset of workers that will only run on that CPU.

  • Workers occasionally hand over control of their CPU to a different worker in their scheduler.

  • At any given moment, each CPU is expected to be running a worker that does something of interest to the middle or upper layers of SQL Server.

  • Some of this useful work will be done on behalf of the worker’s scheduler siblings.

  • However, a (hopefully) tiny percentage of a worker’s time is spent within the act of scheduling.

As usual, this is worth the read.

Comments closed

Understanding Page Splits

Wayne Sheffield goes into detail on page splits:

In considering which of these methods is preferred, we need to consider whether page splits impact these methods – especially nasty page splits. Furthermore, how will index maintenance affect each choice? So let’s think this through.

When there are negative values in this column, and the index is rebuilt, there will be a page with both negative and positive values in it. If the identity column is set to (-1, -1), there won’t be a gap (excluding the 0) in the values, and newly added rows will get a new page allocated – a good page split. If the identity column is set to (-2147483648 , 1), then there will be a full page with the records for the most recently used identity value, and with the values starting with 1 – a rather large gap.

This is worth reading in its entirety.

Comments closed

ReaderWriterSpinlock

Ewald Cress looks at the new ReaderWriterSpinlock in SQL Server 2016 CU2:

As a quick refresher, a traditional SQLOS spinlock is a 32-bit integer, or of course 64-bit as of 2016, with a value of either zero (lock not acquired) or the 32-bit Windows thread ID of the thread that owns it. All very simple and clean in terms of atomic acquire semantics; the only fun part is the exponential backoff tango that results from a collision.

We have also observed how the 2016 flavour of the SOS_RWLock packs a lot of state into 64 bits, allowing more complicated semantics to be implemented in an atomic compare-and-swap. What seems to be politically incorrect to acknowledge is that these semantics boil down to a simplified version of a storage engine latch, who is the unloved and uncool grandpa nowadays.

Clearly a lot can happen in the middle of 64 bits.

Definitely worth a read, as it seems that this is going to get more play in the years to come.

Comments closed

Proportional Fill Algorithm

Paul Randal discusses the proportional fill algorithm that SQL Server uses for extent allocation:

Proportional fill works by assigning a number to each file in the filegroup, called a ‘skip target’. You can think of this as an inverse weighting, where the higher the value is above 1, the more times that file will be skipped when going round the round robin loop. During the round robin, the skip target for a file is examined, and if it’s equal to 1, an allocation takes place. If the skip target is higher than 1, it’s decremented by 1 (to a minimum value of 1), no allocation takes place, and consideration moves to the next file in the filegroup.

(Note that there’s a further twist to this: when the -E startup parameter is used, each file with a skip target of 1 will be used for 64 consecutive extent allocations before the round robin loop progresses. This is documented in Books Online here and is useful for increasing the contiguity of index leaf levels for very large scans – think data warehouses.)

Read on for some implementation details as well as a good scenario for why it’s important to know about this.

Comments closed

Context Switching

Ewald Cress has finally snapped:

You start with a blank sheet,
three nuts and a bolt,
a strong sense of fairness,
a large can of Jolt.

And you try to imagine,
as best as you’re able
a rulerless kingdom
that won’t grow unstable.

That’s what happens when you dig into internals for too long.

Comments closed

LOB On Memory-Optimized Tables

Dmitri Korotkevitch digs into LOB data when you build a memory-optimized table:

There is also considerable overhead in terms of memory usage. Every non-empty off-row value adds 50+ bytes of the overhead regardless of its size. Those 50+ bytes consist of three artificial ID values (in-row, off-row in data row and leaf-level of the range index) and off-row data row structure. It is even larger in case of LOB columns where data is stored in LOB Page Allocator.

One of the key points to remember that decision which columns go off-row is made based on the table schema. This is very different from on-disk tables, where such decision is made on per-row basis and depends on the data row size. With on-disk tables, data is stored in row when it fits on the data page.

In-Memory OLTP works in the different way. (Max) columns are always stored off-row. For other columns, if the data row size in the table definition can exceed 8,060 bytes, SQL Server pushes largest variable-length column(s) off-row. Again, it does not depend on amount of the data you store there.

This is a great article getting into the internals of how memory-optimized tables work in SQL Server 2016, as well as a solid reason to avoid LOB types and and very large VARCHAR values on memory-optimized tables if you can.  Absolutely worth a read.

Comments closed

The SQLOS Scheduler

Ewald Cress builds an extended metaphor for the SQLOS scheduler:

When the time comes for a bunny to pass the battery, it may be out of free choice, or it might be because its script went down a path where passing it along is The Thing To Do. At this juncture, the team’s collective memory and playbook comes to the fore, and agreed rules dictate who the battery goes to. It doesn’t really matter what those rules are for the moment. The important point is that control is transferred by the players themselves using shared rules and a team whiteboard tracking who is ready to go, which team member might be most deserving, who has been waiting the longest etc. This code of conduct and state, this bushido of bunny bonhomie, is what we call a scheduler.

This is building up to something big…

Comments closed

Counting Without Counts

Ewald Cress discusses the breferences member:

I’ll spare you my false starts, but I think I finally have it. The first observation is that, on the occasions breferences increments, it does not increment linearly, but instead has an exponential growth pattern. These increments take it through the sequence 0, 1, 3, 7, 15, 31, 127, 255 etc. Or in binary: 0, 1, 11, 111, 1111, 11111, 111111, 1111111, 11111111…

Those numbers can be seen as off-by-one variations of powers of two. Forget the offset, and think of the number as simply doubling on each increment if it keeps your head clearer – instead of accuracy, we have a order-of-magnitude reference count.

I’d never heard of an algorithm like this, although that could be due to my having dealt with relatively little low-level structural code.  I’m glad Ewald sussed out the mechanics driving breferences.

Comments closed

Superlatches

Ewald Cress discusses superlatch promotion:

There are quite a few pieces of machinery that are involved in our little drama. First, I’ll introduce some instance-global settings:

  • A flag that controls whether latch promotion is enabled at all. Although I don’t have any information about this, let’s assume that it will be enabled on any system that “warrants it”.
  • A flag that controls whether cycle-based promotion is enabled. Again, I can’t currently tell you what determines this setting.
  • sm_promotionThreshold, the current calculated cycle-based promotion threshold described in Part 3.
  • sm_promotionUpperBoundCpuTicks, used as a ceiling value to prevent outliers from skewing stats. As described in Part 3, this is simply sm_promotionThreshold * 5.
  • Trace flag 844, which lowers the threshold for non-cycle-based promotions.
  • Trace flag 827, which causes each latch promotion to be noted in the SQL Server log (“Latch promotion, page %u:%u in database %u, objid %u.”)

Assume that the first flag is set on our system of interest, otherwise promotions won’t happen and we have nothing to talk about.

Read the whole thing.

Comments closed

Latch Promotion

Ewald Cress discusses latch promotion threshold calculations:

Now I wish I could use the phrase “cycle-based promotion threshold” in a tone that suggests we were all born knowing the context, but to be honest, I don’t yet have all the pieces. Here is the low-down as it stands in SQL Server 2014:

  • Everything I’m describing applies only to page latches.

  • A cycle-based promotion simply means one that is triggered by the observation that the average acquire time for a given page latch (i.e. the latch for a given page) has exceeded a threshold.

  • Because the times involved are so short, they are measured not in time units but in CPU ticks.

  • There exists a global flag that can enable cycle-based promotions, although I do not know what controls that flag.

  • If cycle-based promotion is disabled, there is another path to promotion; this will be be discussed in Part 4.

I don’t think I’d ever seen the informational message Ewald mentions, so this was a brand new topic to me.

Comments closed