Press "Enter" to skip to content

Category: Statistics

Poorly-Performing Parallel Queries

Joe Obbish shows off how skewed data can cause SQL Server parallelism to perform poorly in certain scenarios:

The query above is designed to not be able to take advantage of parallelism. The useless repartition streams step and the spill to tempdb suggest that the query might perform better with a MAXDOP 1 hint. With a MAXDOP 1 hint the query runs with an average time of 2473 ms. There is no longer a spill to tempdb.

What happens if the query is run with MAXDOP 3? Earlier I said that the hashing function or thread boundaries can change based on DOP. With MAXDOP 3 I get a much more even row distribution on threads:

I think the number of cases where it makes sense to use a specific, non-1 MAXDOP hint is pretty small, but here’s one of them.  The problem is that if this data changes regularly, the skewness of the data could change along with it, making your brilliant optimization unnecessary or even harmful.

Comments closed

Diving Into Spark’s Cost-Based Optimizer

Ron Hu, et al, explain how Spark’s cost-based optimizer works:

At its core, Spark’s Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them. A majority of these optimization rules are based on heuristics, i.e., they only account for a query’s structure and ignore the properties of the data being processed, which severely limits their applicability. Let us demonstrate this with a simple example. Consider a query shown below that filters a table t1 of size 500GB and joins the output with another table t2of size 20GB. Spark implements this query using a hash join by choosing the smaller join relation as the build side (to build a hash table) and the larger relation as the probe side 1. Given that t2 is smaller than t1, Apache Spark 2.1 would choose the right side as the build side without factoring in the effect of the filter operator (which in this case filters out the majority of t1‘s records). Choosing the incorrect side as the build side often forces the system to give up on a fast hash join and turn to sort-merge join due to memory constraints.

Click through for a very interesting look at this query optimzier.

Comments closed

Saving Statistics Sample Rates

Pedro Lopes shows off a new feature in the latest SQL Server 2016 CU:

When SQL Server creates or updates statistics and a sampling rate is not manually specified, SQL Server calculates a default sampling rate. Depending on the real distribution of data in the underlying table, the default sampling rate may not accurately represent the data distribution and then cause degradation of query plan efficiency.

To improve this scenario, a database administrator can choose to manually update statistics with a specific sampling rate that can better represent the distribution of data. However, a subsequent automatic update statistics operation will reset back to the default sampling rate, possibly reintroducing degradation of query plan efficiency.

With the most recent SQL Server 2016 SP1 CU4, we released an enhancement for the CREATE and UPDATE STATISTICS command – the ability to persist sampling rates between updates with a PERSIST_SAMPLE_PERCENT keyword.

This seems rather useful.

Comments closed

Filtered Statistics

William Wolf shows us the value of filtered statistics:

Wolf only had 700 complaints, but 166,900 records were estimated for return. He is looking much worse than reality shows.

So, what is happening is that there are 3 possible employee results for complaints. It is rather simple. CE is taking the total amount of records(500,701) and dividing by 3 assuming that all 3 will have roughly the same amount of records. We see that along with the estimated number of records being the same, the execution plan operators are the same. For such a variation in amount of records, there must be a better way.

I rarely create filtered statistics, in part because I don’t have a good idea of exactly which values people will use when searching.  But one slight change to Wolf’s scenario might help:  having a filter where name = Sunshine and a filter where name <> Sunshine (or name is null).  That might help a case where there’s extreme skew with one value and the rest are much closer to uniformly distributed.

Comments closed

Combining Densities

Paul White explains how the SQL Server cardinality estimator will build an estimate involving multiple single-column statistics:

The task of estimating the number of rows produced by a GROUP BY clause is trivial when only a single column is involved (assuming no other predicates). For example, it is easy to see that GROUP BY Shelf will produce 21 rows; GROUP BY Bin will produce 62.

However, it is not immediately clear how SQL Server can estimate the number of distinct (Shelf, Bin) combinations for our GROUP BY Shelf, Bin query. To put the question in a slightly different way: Given 21 shelves and 62 bins, how many unique shelf and bin combinations will there be? Leaving aside physical aspects and other human knowledge of the problem domain, the answer could be anywhere from max(21, 62) = 62 to (21 * 62) = 1,302. Without more information, there is no obvious way to know where to pitch an estimate in that range.

Yet, for our example query, SQL Server estimates 744.312 rows (rounded to 744 in the Plan Explorer view) but on what basis?

Read on for debugger usage, Shannon entropy calculations, and all kinds of other fun stuff.

Comments closed

The Costs Of Statistics Updates With FULLSCAN

Kendra Little explains what happens when you update a table’s statistics with FULLSCAN:

On my test instance, the command that uses the default sampling takes 6 seconds to complete.

The command which adds “WITH FULLSCAN” takes just over five minutes to complete.

The reason is that those two little words can add a whole lot of extra IO to the work of updating statistics.

Kendra shows the query plans for each statistics update in some detail.  It’s a very interesting post, well worth taking the time to read.

Comments closed

NULL Values In The Histogram

Taiob Ali explains how NULL values show up in the SQL Server histogram when you create statistics:

In the density_vector section ‘All density’ value for column ‘PickingCompletedWhen’ is 0.0004705882 which was calculated from: 1/(Number of distinct values of column ‘PickingCompletedWhen’).
In this case which is 1/2125. All NULL values were considered as one. If you do a count of distinct values you will get a result of 2124. Reason is explained here. If you do a select of all distinct values NULL will show along with other 2124 values.

Taiob explains that there’s really nothing special about NULL when it comes to statistics.

Comments closed

Seeing Statistics In Execution Plans

Pedro Lopes announces that the statistics used to compile a plan are now available as part of the execution plan details:

OptimizerStatsUsage is available in cached plans, so getting the “estimated execution plan” and the “actual execution plan” will have this information.

In the above example, I see the ModificationCount is very high (almost as much as the table cardinality itself) which after closer observation, the statistic had been updated with NORECOMPUTE.

And looking and the Seek itself, there is a large skew between estimated and actual rows. In this case, I now know a good course of action is to update statistics. Doing so produces this new result: ModificationCounter is back to zero and estimations are now correct.

This will be a good addition to SQL Server 2017.

Comments closed

The Story Of Nick

Kenneth Fisher tells the story of where the optimizer’s cost value comes from:

Obviously, it’s an important subject, right? And yet we keep seeing comments about how the cost is in seconds.

And to be fair, it is. It’s an estimate of how many seconds a query would take, if it was running on a developers workstation from back in the 90’s. At least that’s the story. In fact Dave Dustin (t) posted this interesting story today:

The best way to think of cost is as a probabilistic, ordinal, unitless value:  3 might be greater than 2; 1000 is almost certainly greater than 2; and “2 what?” is undefined.

Comments closed

Cardinality Estimation On COUNT(*)

Paul White digs into how the cardinality estimator works with COUNT aggregations containing HAVING clauses:

The approach SQL Server takes is to assume that each group is most likely to contain the overall mean (average) number of rows. This is simply the cardinality divided by the number of unique values. For example, for 1000 rows with 20 unique values, SQL Server would assume that (1000 / 20) = 50 rows per group is the most likely value.

Turning back to our original example, this means that the computed count column is “most likely” to contain a value around (19614 / 575) ~= 34.1113. Since density is the reciprocal of the number of unique values, we can also express that as cardinality * density = (19614 * 0.00173913), giving a very similar result.

Definitely worth a careful read.

Comments closed