Updating Statistics On System Tables

Dan Guzman shows that you can update statistics on system tables:

The solution in many cases to simply update statistics on the underlying system tables indexes used by the problem DMV queries. This can be done selectively by identifying the system table indexes referenced in execution plan seek and scan operators of the problem query execution plan and then executing UPDATE STATISTICS on each index. However, the task is somewhat tedious.

Alternatively, one can simply update stats on all the system tables. Below is a script that generates and executes DDL to update stats on all non-empty system tables, making quick work of this.

Click through for the script, as well as an interesting note if you try to use constructs like @sql = @sql + N’some string’ in your code.

Base Versus Simple Containment

Joe Obbish takes a crack at explaining the difference between base containment and simple containment for cardinality estimation:

We know that the first query will return 500k rows and the second query will return 0 rows. However, can SQL Server know that? Each statistics object only contains information about its own column. There’s no correlation between the UNIQUE_ID and MOD_FILTER columns, so there isn’t a way for SQL Server to know that the queries will return different estimates. The query optimizer can create an estimate based on the filters on the WHERE clause and on the histograms of the join columns, but there’s no foolproof way to do that calculation. The presence of the filters introduces uncertainty into the estimate, even with statistics that perfectly describe the data for each column. The containment assumption is all about the modeling assumption that SQL Server has to make to resolve that uncertainty.

It’s an interesting post aimed at trying to get you to think like a simplified cardinality estimator.  SQL Server doesn’t behave exactly like this, but it’s a good mental reference point.

Asynchronous Stats Updates

Monica Rathbun explains why you might want to turn on asynchronous statistics updates in your OLTP environment:

By default, when Auto Update Statistics is set to True, the SQL Server Query Optimizer will automatically update statistics when data has met a threshold of changes (insert, update, delete, or merge) and the estimated rows are now potentially stale. When statistics are stale, execution plans can become suboptimal which can lead to degradation in performance.

This best practice option ensures your statistics stay up to date as much as possible. Each time a cached query plan is executed the Optimizer checks for data changes and potentially generates new statistics. This behavior is exactly what we want, but there is a catch. The caveat to this is that a cached query plan will be “held” while the statistics are updated and will recompile to use the new values before running. This caveat can slow down the execution process dramatically.

The advice Monica provides is generally sound, though there are rare cases when asynchronous statistics updates end up causing more problems than they solve, so test the change first.

How Statistics In SQL Server Have Changed Over The Years

Erin Stellato gives us a version-based timeline of how SQL Server has handled statistics over the years:

SQL Server 2008

This is a very interesting historical look.  Most interesting to me was the decreases in the number of steps available.

Row Goals On Nested Loops

Joe Obbish has performed a very interesting investigation of how row goals work with nested loop joins and the TOP operator:

This does not happen. The cost remains the same as before: 0.294842 units. This is because the scan is costed according to density instead of by looking at the histogram of the outer table. The following query with a local variable repeated five times also has a cost of 0.294842 optimizer units:

VALUES (@var), (@var), (@var), (@var), (@var)
) s (ID)

The problem with using density instead of looking at the data in the outer table is mostly apparent when the outer table contains rows without a match in the inner table.

It’s a great bit of investigative legwork and Joe has a Connect item he’d like you to upvote.

Linked Servers And Inaccessible Statistics

Jason Brimhall troubleshoots an error message involving linked servers and statistics:

On more than one occasion I have had an emergency request because everything was broken. The everything in almost every incident is an SSIS package that is failing with error messages. The error message will typically have text similar to the following:

Could not locate statistics ‘_WA_Sys_00000015_346C780E’ in the system catalogs.

Due to the error, the package fails processing and grinds to a halt. When diving into the package it is discovered that the missing stats happen to be coming from a linked server query. This raises a big bright blaring alarm for me. Why is the SSIS package accessing the data via a linked server? This is rather counter-productive and definitely contrary to what is desired from a performance perspective.

Jason methodically walks us through the troubleshooting process and provides the solution at the end.

Estimates Outside The Histogram

Lonny Niederstadt is building up some information on how the cardinality estimator works when it needs to generate an estimate outside the histogram it has:

SQL Server keeps track of how many inserts and deletes since last stats update – when the number of inserts/deletes exceeds the stats update threshold the next time a query requests those stats it’ll qualify for an update.  Trace flag 2371 alters the threshold function before SQL Server 2016. With 2016 compatibility mode, the T2371 function becomes default behavior.  Auto-stats update and auto-stats update async settings of the database determine what happens once the stats qualify for an update.  But whether an auto-stats update or a manual stats update, the density, histogram, etc are all updated.

Trace flags 2389, 2390, 4139, and the ENABLE_HIST_AMENDMENT_FOR_ASC_KEYS hint operate outside the full stats framework, bringing in the quickstats update.  They have slightly different scope in terms of which stats qualify for quickstats updates – but in each case its *only* stats for indexes, not stats for non-indexed columns that can qualify.  After 3 consecutive stats updates on an index, SQL Server “brands” the stats type as ascending or static, until then it is branded ‘unknown’. The brand of a stat can be seen by setting trace flag 2388 at the session level and using dbcc show_statistics.

Right now there are just a few details and several links, but it does look like he’s going to expand it out.

Poorly-Performing Parallel Queries

Joe Obbish shows off how skewed data can cause SQL Server parallelism to perform poorly in certain scenarios:

The query above is designed to not be able to take advantage of parallelism. The useless repartition streams step and the spill to tempdb suggest that the query might perform better with a MAXDOP 1 hint. With a MAXDOP 1 hint the query runs with an average time of 2473 ms. There is no longer a spill to tempdb.

What happens if the query is run with MAXDOP 3? Earlier I said that the hashing function or thread boundaries can change based on DOP. With MAXDOP 3 I get a much more even row distribution on threads:

I think the number of cases where it makes sense to use a specific, non-1 MAXDOP hint is pretty small, but here’s one of them.  The problem is that if this data changes regularly, the skewness of the data could change along with it, making your brilliant optimization unnecessary or even harmful.

Diving Into Spark’s Cost-Based Optimizer

Ron Hu, et al, explain how Spark’s cost-based optimizer works:

At its core, Spark’s Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them. A majority of these optimization rules are based on heuristics, i.e., they only account for a query’s structure and ignore the properties of the data being processed, which severely limits their applicability. Let us demonstrate this with a simple example. Consider a query shown below that filters a table t1 of size 500GB and joins the output with another table t2of size 20GB. Spark implements this query using a hash join by choosing the smaller join relation as the build side (to build a hash table) and the larger relation as the probe side 1. Given that t2 is smaller than t1, Apache Spark 2.1 would choose the right side as the build side without factoring in the effect of the filter operator (which in this case filters out the majority of t1‘s records). Choosing the incorrect side as the build side often forces the system to give up on a fast hash join and turn to sort-merge join due to memory constraints.

Click through for a very interesting look at this query optimzier.

Saving Statistics Sample Rates

Pedro Lopes shows off a new feature in the latest SQL Server 2016 CU:

When SQL Server creates or updates statistics and a sampling rate is not manually specified, SQL Server calculates a default sampling rate. Depending on the real distribution of data in the underlying table, the default sampling rate may not accurately represent the data distribution and then cause degradation of query plan efficiency.

To improve this scenario, a database administrator can choose to manually update statistics with a specific sampling rate that can better represent the distribution of data. However, a subsequent automatic update statistics operation will reset back to the default sampling rate, possibly reintroducing degradation of query plan efficiency.

With the most recent SQL Server 2016 SP1 CU4, we released an enhancement for the CREATE and UPDATE STATISTICS command – the ability to persist sampling rates between updates with a PERSIST_SAMPLE_PERCENT keyword.

This seems rather useful.


December 2017
« Nov