Nested Loops And Implicit Reordering

Dmitry Piliugin shows how the SQL Server optimizer can end up reordering data in a nested loops join to improve performance:

The purpose is to minimize random access impact. If we perform an Index Seek (with a partial scan, probably) we read the entries in the index order, in our case, in the order of CustomerID, which is clearly seen on the first result set. The index on CustomerID does not cover our query, so we have to ask the clustered index for the column SomeData, and actually, we perform one another seek, seeking by the SalesOrderID column. This is a random seek, so what if, before searching by the SalesOrderID we will sort by that key, and then issue an ordered sequence of Index Seeks, turning the random acces into the sequential one, wouldn’t it be more effective?

Yes, it would in some cases, and that is what “optimized” property tells us about. However, we remember, that it is not necessarily leads to the real reordering. As for comparing the real impact, I will refer you to the actual Craig’s post or leave it as a homework.

Read the whole thing.  This is one reason why it’s important to emphasize that in SQL, you can only assume order if you have an explicit ORDER BY clause.

What Read Committed Isolation Level Gets You

Paul Randal explains the answer, which is “not much”:

The ‘weird’ behavior is that when the “Batch 2” select completes, after having been blocked by the “Batch 1” transaction, it doesn’t return all 1,000 rows (even though “Batch 1” has completed). Furthermore, depending on when the “Batch 2” select is started, during the 10-seconds that “Batch 1” executes, “Batch 2” returns different numbers of rows. This behavior had also been reported on earlier versions of SQL Server as well. It’s easy to reproduce on SQL Server 2016/2017 and can be reproduced in all earlier versions with a single configuration change (more details in a moment).

Additionally, if the table has a clustered index created, 1,000 rows are returned every time, on all versions of SQL Server.

So why is this weird? Many people expect that all 1,000 rows will be returned every time AND that the structure of the table or the version of SQL Server should not make any difference.

Unfortunately, that assumption is not correct when using read committed.

Read Committed is a trade-off, not an ideal.

How The Hash Match Join Operation Works

Dmitry Piliugin explains what happens when SQL Server calls for a hash match to join two tables together:

Hash Match in the join mode consumes two inputs, as we are joining two tables. The main idea is to build the hash table using the first “build” input, and then apply the same approach hash the second “probe” input to see if there will be matches of hashed values.

Query Processor (QP) is doing many efforts while building the plan to choose the correct join order. From the Hash Match prospective, it means that QP should choose what table is on the Build side and what is on the Probe side. The Build size should be smaller as it will be stored in memory when building a hash table.

Building a hash table begins with hashing join key values of the build table and placing them to one or another bucket depending on the hash value. Then QP starts processing the probe side, it applies the same hash function to the probe values, determining the bucket and compares the values inside of the bucket. If there is a match – the row is returned.

That would be the whole story if we had infinite memory, but in the real world, it is not true. More to the point, SQL Server allocates memory to the query before the execution starts and does not change it during the execution. That means that if the allocated memory amount is much less than the data size came during the execution, a Hash Match should be able to partition the joining data, and process it in portions that fit allocated memory, while the rest of the data is spilled to the disk waiting to be processed. Here is where the dancing begins.

Read on to learn more about the details of this operation.

The Value Of Spinlocks

Jeremiah Peschka walks us through the concepts behind spinlocks:

Here’s our fist stab at a spinlock, badly written in C:

// lock_value is an integer that's shared between multiple threads
while (lock_value != 0) { // spin
lock_value = 1;
lock_value = 0;

The general idea here is correct – we have some lock_value and we only want to allow processes into the critical section of code if the another process doesn’t “hold the lock.” Holding the lock means that lock_value is set to a non-zero value.

There are few things that make this code bad. These things, coincidentally, are part of what make concurrent programming difficult.

Spinlocks are a critical part of maintaining internal consistency, but they can also accidentally ruin performance on your system if done wrong.  Read the whole thing.

How LSNs Get Generated

Stuart Moore looks at how SQL Server builds log sequence numbers:

If you’ve ever dug down in the SQL Server transaction logs or had to build up restore chains, then you’ll have come across Log Sequence Numbers (LSNs). Ever wondered why they’re so large, why they all look suspiciously the same, why don’t they start from 0 and just how does SQL Server generate these LSNs? Well, here we’re going to take a look at them

Below we’ll go through examples of how to look inside the current transaction log, and backed up transaction logs. This will involve using some DBCC commands and the undocumented fn_dblog and fn_dump_dblog function. The last 2 are very handy for digging into SQL Server internals, but be wary about running them on a production system without understanding what’s going on. They can leave filehandles and processes behind that can impact on your system.

It’s an interesting look into SQL Server’s internals.

Row Goals On Anti-Joins

Paul White continues his row goals series:

The optimizer assumes that people write a semi join (indirectly e.g. using EXISTS) with the expectation that the row being searched for will be found. An apply semi join row goal is set by the optimizer to help find that expected matching row quickly.

For anti join (expressed e.g. using NOT EXISTS) the optimizer’s assumption is that a matching row will not be found. An apply anti join row goal is not set by the optimizer, because it expects to have to check all rows to confirm there is no match.

If there does turn out to be a matching row, the apply anti join might take longer to locate this row than it would if a row goal had been used. Nevertheless, the anti join will still terminate its search as soon as the (unexpected) match is encountered.

This is a shorter article but very useful in understanding row goals, along with the rest of his series.

Using Extended Properties For Documentation

Phil Factor shows us how we can use Extended Properties to build database documentation:

Once you’ve got into the habit of using Extended Properties to document your database, there are obvious benefits:

  • You can explain why you added that index or modified that constraint.
  • You can describe exactly what that rather obscure column does.
  • You can add a reasoned explanation to the use of a table.

You will often need these explanations because, sadly, DDL code isn’t ‘self-documenting’, and human memory is fallible. Extended Properties are easily searched because they are all exposed in one system view.

It is great to add explanations to lists of procedures, functions and views once the database becomes sizeable. Extended Properties are useful when exploring the metadata, but the requirement isn’t quite so essential because comments are preserved along with the source code. Tables, however, are a big problem because SQL Server throws away the script that produces the table, along with all the comments. The reason that this happens is that there are many ways you can alter parts of a table without scripting the entire table. How could one infallibly preserve all these ALTER statements in the preserved script? It’s tricky. Table scripts that you get from SSMS or script via SMO are therefore synthesised from the system tables but without those comments or even Extended Properties.

Extended Properties are useful, but I think the lack of tooling around them prevented widespread adoption.  Now that there are a few tools which support them (including SSMS’s data classification mechanism), I wonder if these will get a second look.

Row Goals And Semi Joins

Paul White continues his row goals series:

The remaining physical join type is nested loops, which comes in two flavours: regular (uncorrelated) nested loops and apply nested loops (sometimes also referred to as a correlated or lateral join).

Regular nested loops join is similar to hash and merge join in that the join predicate is evaluated at the join. As before, this means there is no value in setting a row goal on either input. The left (upper) input will always be fully consumed eventually, and the inner input has no way to determine which row(s) should be prioritized, since we cannot know if a row will join or not until the predicate is tested at the join.

By contrast, an apply nested loops join has one or more outer references (correlated parameters) at the join, with the join predicate pushed down the inner (lower) side of the join. This creates an opportunity for the useful application of a row goal. Recall that a semi join only requires us to check for the existence of a row on join input B that matches the current row on join input A (thinking just about nested loops join strategies now).

In other words, on each iteration of an apply, we can stop looking at input B as soon as the first match is found, using the pushed-down join predicate. This is exactly the sort of thing a row goal is good for: generating part of a plan optimized to return the first n matching rows quickly (where n = 1 here).

This has the depth and quality that you naturally expect from Paul, making it an immediate read.

Digging Into The In-Memory Columnstore Location

Niko Neugebauer does some investigation into where, exactly, memory-optimized columnstore data goes:

This is a rather simple blog post that is dedicated to the theme of the In-Memory Columnstore Indexes location. This has been a constant topic of discussion over a long period of time, even during the public events – and there is a need to clear out this topic.

I have assumed that the In-Memory Columnstore structures (Segments, Dictionaries, …) are located in the In-Memory, but there have been voices that I greatly respect, pointing that actually the Columnstore Object Pool is the exact location of any Columnstore structures, and there is nothing better than to take this feature for a ride and see what the SQL Server engine is actually doing.

Niko shows off a couple of useful DMVs along the way, too.

The Links That Tie Row To LOB

Steve Stedman shows how to use DBCC PAGE and DBCC IND to piece together where LOB data is stored for a particular row:

The question came up as how to find a link from blog storage that is corrupt back to the table and row that contains that data.

The is no link from the blob storage back to the table and row, but this is a link from the data page containing the table and row off to the blob data.

Read the whole thing.


May 2018
« Apr