Obviously, it’s an important subject, right? And yet we keep seeing comments about how the cost is in seconds.
And to be fair, it is. It’s an estimate of how many seconds a query would take, if it was running on a developers workstation from back in the 90’s. At least that’s the story. In fact Dave Dustin (t) posted this interesting story today:
The best way to think of cost is as a probabilistic, ordinal, unitless value: 3 might be greater than 2; 1000 is almost certainly greater than 2; and “2 what?” is undefined.
The approach SQL Server takes is to assume that each group is most likely to contain the overall mean (average) number of rows. This is simply the cardinality divided by the number of unique values. For example, for 1000 rows with 20 unique values, SQL Server would assume that (1000 / 20) = 50 rows per group is the most likely value.
Turning back to our original example, this means that the computed count column is “most likely” to contain a value around (19614 / 575) ~= 34.1113. Since density is the reciprocal of the number of unique values, we can also express that as cardinality * density = (19614 * 0.00173913), giving a very similar result.
Definitely worth a careful read.
Unlike traditional Btree indexes, when a columnstore index is created, there is no index statistics created on the columns of the columnstore indexes. However, there is an empty stats object created with the same name as columnstore index and an entry is added to sys.stats at the time of index creation. The stats object is populated on the fly when a query is executed against the columnstore index or when executing DBCC SHOW_STATISTICS against the columnstore index, but the columnstore index statistics aren’t persisted in the storage. The index statistics is different from the auto created statistics on the individual columns of columnstore indexes which is generated on the fly and persisted in the statistics object. Since the index statistics is
not persisted in storage, the clonedatabase will not contain those statistics leading to inaccurate stats and different query plans when same query has run against database clone as opposed to production database.
Click through for the script.
Once I update my statistics with fullscan, with in 10~20 seconds some of the statistics on the same table are getting update on secondary with a sample pecent of rows. Meaning my best statistics are being overwritten with good (full vs sample) statistics. On primary node once I run “Update statistics Tablename with fullscan” . I see following about statistics status.
After 10~20 seconds of updating statistics in primary node if I check the status of the same on my secondary nodes, I see fullscan statistics is replaced by sample statistics. Look at the rows_sampled and last_updated column, you will see the sample row number and last_updated column time is within few seconds of update in primary. RowsModified column still showing zero records.
It’s happening on an Availability Group secondary. Taiob has a workaround, so read on for that.
Yeah, there you go, all these _WA_Sys_ stats tell me they have been automatically created (there is a flag in sys.stats if you don’t believe me) but I can see there are only 31, where I created 34 columns.
That’s funny, let’s see which data types did get statistics.
The results are pretty interesting.
NOTE: The most important habit you can start with in Azure is putting everything into discrete, planned, Resource Groups. These make management so much easier.
Once the account is set, the first thing you need is to create a Runbook. There is a collection of them for your use within Azure. None of them are immediately applicable for what I need. I’m just writing a really simple Powershell script to do what I want:
Runbooks are an important part of Azure maintenance, and this is a gentle introduction to them.
Running multiple UPDATE STATISTICS commands for different statistics on a single table concurrently has been available under global Trace Flag 7471 since SQL Server 2014 SP1 CU6 and SQL Server 2016 CU1. Microsoft have documented this trace flag here and here.
It sounds like, for the most part, you might not want this flag turned on, but read the whole post.
“Starting with SQL Server 2016, sampling of data to build statistics is done in parallel, when using compatibility level 130, to improve the performance of statistics collection. The query optimiser will use parallel sample statistics whenever a table size exceeds a certain threshold.”
As per the previous demos, prior to SQL Server 2016, the gathering of sampled statistics are serial, single-threaded operations. Only statistic operations gathered using a full scan would qualify for parallelism. Now in SQL Server 2016, all automatic and manual statistic operations qualify for parallelism.
He also has a neat trick for invalidating stats on a large table, so check out this article-length blog post.
It’s not exactly perfect
For instance, if you just let it loose without filters, you get a severe error. The same thing happens if you try to filter on one of the columns in the function, rather than a column in sys.stats, like this.
Very cool. It’s one step closer to us removing our dependencies on DBCC SHOW_STATISTICS.
With that said, it would seem that asynchronous statistics are better suited to OLTP environments and synchronous statistics are better suited to OLAP environments. As synchronous statistics are the default though, people are reluctant to change this setting without good reason to. I’ve worked exclusively in OLTP environments over the last few years and have never seen asynchronous statistics rolled out as the default. I have personally been bitten by synchronous statistic updates on large tables causing query timeouts which I resolved by switching to asynchronous statistic updates.
This is an interesting, nuanced take on the issue. My bias is toward asynchronous stats updates because I have been burned, but it’s interesting to read someone thinking through the implications of this seemingly simple choice.