Estimates Outside The Histogram

Lonny Niederstadt is building up some information on how the cardinality estimator works when it needs to generate an estimate outside the histogram it has:

SQL Server keeps track of how many inserts and deletes since last stats update – when the number of inserts/deletes exceeds the stats update threshold the next time a query requests those stats it’ll qualify for an update.  Trace flag 2371 alters the threshold function before SQL Server 2016. With 2016 compatibility mode, the T2371 function becomes default behavior.  Auto-stats update and auto-stats update async settings of the database determine what happens once the stats qualify for an update.  But whether an auto-stats update or a manual stats update, the density, histogram, etc are all updated.

Trace flags 2389, 2390, 4139, and the ENABLE_HIST_AMENDMENT_FOR_ASC_KEYS hint operate outside the full stats framework, bringing in the quickstats update.  They have slightly different scope in terms of which stats qualify for quickstats updates – but in each case its *only* stats for indexes, not stats for non-indexed columns that can qualify.  After 3 consecutive stats updates on an index, SQL Server “brands” the stats type as ascending or static, until then it is branded ‘unknown’. The brand of a stat can be seen by setting trace flag 2388 at the session level and using dbcc show_statistics.

Right now there are just a few details and several links, but it does look like he’s going to expand it out.

Poorly-Performing Parallel Queries

Joe Obbish shows off how skewed data can cause SQL Server parallelism to perform poorly in certain scenarios:

The query above is designed to not be able to take advantage of parallelism. The useless repartition streams step and the spill to tempdb suggest that the query might perform better with a MAXDOP 1 hint. With a MAXDOP 1 hint the query runs with an average time of 2473 ms. There is no longer a spill to tempdb.

What happens if the query is run with MAXDOP 3? Earlier I said that the hashing function or thread boundaries can change based on DOP. With MAXDOP 3 I get a much more even row distribution on threads:

I think the number of cases where it makes sense to use a specific, non-1 MAXDOP hint is pretty small, but here’s one of them.  The problem is that if this data changes regularly, the skewness of the data could change along with it, making your brilliant optimization unnecessary or even harmful.

Diving Into Spark’s Cost-Based Optimizer

Ron Hu, et al, explain how Spark’s cost-based optimizer works:

At its core, Spark’s Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them. A majority of these optimization rules are based on heuristics, i.e., they only account for a query’s structure and ignore the properties of the data being processed, which severely limits their applicability. Let us demonstrate this with a simple example. Consider a query shown below that filters a table t1 of size 500GB and joins the output with another table t2of size 20GB. Spark implements this query using a hash join by choosing the smaller join relation as the build side (to build a hash table) and the larger relation as the probe side 1. Given that t2 is smaller than t1, Apache Spark 2.1 would choose the right side as the build side without factoring in the effect of the filter operator (which in this case filters out the majority of t1‘s records). Choosing the incorrect side as the build side often forces the system to give up on a fast hash join and turn to sort-merge join due to memory constraints.

Click through for a very interesting look at this query optimzier.

Saving Statistics Sample Rates

Pedro Lopes shows off a new feature in the latest SQL Server 2016 CU:

When SQL Server creates or updates statistics and a sampling rate is not manually specified, SQL Server calculates a default sampling rate. Depending on the real distribution of data in the underlying table, the default sampling rate may not accurately represent the data distribution and then cause degradation of query plan efficiency.

To improve this scenario, a database administrator can choose to manually update statistics with a specific sampling rate that can better represent the distribution of data. However, a subsequent automatic update statistics operation will reset back to the default sampling rate, possibly reintroducing degradation of query plan efficiency.

With the most recent SQL Server 2016 SP1 CU4, we released an enhancement for the CREATE and UPDATE STATISTICS command – the ability to persist sampling rates between updates with a PERSIST_SAMPLE_PERCENT keyword.

This seems rather useful.

Filtered Statistics

William Wolf shows us the value of filtered statistics:

Wolf only had 700 complaints, but 166,900 records were estimated for return. He is looking much worse than reality shows.

So, what is happening is that there are 3 possible employee results for complaints. It is rather simple. CE is taking the total amount of records(500,701) and dividing by 3 assuming that all 3 will have roughly the same amount of records. We see that along with the estimated number of records being the same, the execution plan operators are the same. For such a variation in amount of records, there must be a better way.

I rarely create filtered statistics, in part because I don’t have a good idea of exactly which values people will use when searching.  But one slight change to Wolf’s scenario might help:  having a filter where name = Sunshine and a filter where name <> Sunshine (or name is null).  That might help a case where there’s extreme skew with one value and the rest are much closer to uniformly distributed.

Combining Densities

Paul White explains how the SQL Server cardinality estimator will build an estimate involving multiple single-column statistics:

The task of estimating the number of rows produced by a GROUP BY clause is trivial when only a single column is involved (assuming no other predicates). For example, it is easy to see that GROUP BY Shelf will produce 21 rows; GROUP BY Bin will produce 62.

However, it is not immediately clear how SQL Server can estimate the number of distinct (Shelf, Bin) combinations for our GROUP BY Shelf, Bin query. To put the question in a slightly different way: Given 21 shelves and 62 bins, how many unique shelf and bin combinations will there be? Leaving aside physical aspects and other human knowledge of the problem domain, the answer could be anywhere from max(21, 62) = 62 to (21 * 62) = 1,302. Without more information, there is no obvious way to know where to pitch an estimate in that range.

Yet, for our example query, SQL Server estimates 744.312 rows (rounded to 744 in the Plan Explorer view) but on what basis?

Read on for debugger usage, Shannon entropy calculations, and all kinds of other fun stuff.

The Costs Of Statistics Updates With FULLSCAN

Kendra Little explains what happens when you update a table’s statistics with FULLSCAN:

On my test instance, the command that uses the default sampling takes 6 seconds to complete.

The command which adds “WITH FULLSCAN” takes just over five minutes to complete.

The reason is that those two little words can add a whole lot of extra IO to the work of updating statistics.

Kendra shows the query plans for each statistics update in some detail.  It’s a very interesting post, well worth taking the time to read.

NULL Values In The Histogram

Taiob Ali explains how NULL values show up in the SQL Server histogram when you create statistics:

In the density_vector section ‘All density’ value for column ‘PickingCompletedWhen’ is 0.0004705882 which was calculated from: 1/(Number of distinct values of column ‘PickingCompletedWhen’).
In this case which is 1/2125. All NULL values were considered as one. If you do a count of distinct values you will get a result of 2124. Reason is explained here. If you do a select of all distinct values NULL will show along with other 2124 values.

Taiob explains that there’s really nothing special about NULL when it comes to statistics.

Seeing Statistics In Execution Plans

Pedro Lopes announces that the statistics used to compile a plan are now available as part of the execution plan details:

OptimizerStatsUsage is available in cached plans, so getting the “estimated execution plan” and the “actual execution plan” will have this information.

In the above example, I see the ModificationCount is very high (almost as much as the table cardinality itself) which after closer observation, the statistic had been updated with NORECOMPUTE.

And looking and the Seek itself, there is a large skew between estimated and actual rows. In this case, I now know a good course of action is to update statistics. Doing so produces this new result: ModificationCounter is back to zero and estimations are now correct.

This will be a good addition to SQL Server 2017.

The Story Of Nick

Kenneth Fisher tells the story of where the optimizer’s cost value comes from:

Obviously, it’s an important subject, right? And yet we keep seeing comments about how the cost is in seconds.

And to be fair, it is. It’s an estimate of how many seconds a query would take, if it was running on a developers workstation from back in the 90’s. At least that’s the story. In fact Dave Dustin (t) posted this interesting story today:

The best way to think of cost is as a probabilistic, ordinal, unitless value:  3 might be greater than 2; 1000 is almost certainly greater than 2; and “2 what?” is undefined.


October 2017
« Sep