Years ago I was of the opinion that it wasn’t really possible to see the statistics used in the generation of a query plan. If you read the comments here, I was corrected of that notion. However, I’ve never been a fan of using undocumented trace flags. Yeah, super heroes like Fabiano Amorim and Paul White use them, but for regular individuals like me, it seems like a recipe for disaster. Further, if you read about these trace flags, they cause problems on your system. Even Fabiano was getting the occasional crash.
So, what’s a safe way to get that information? First up, Extended Events. If you use the auto_stats event, you can see the statistics getting created and getting loaded and used. Even if they’re not created, you can see them getting loaded. It’s an easy way to quickly see which statistics were used to generate a plan. One note, you’ll have to compile or recompile a given query to see this in action.
Read on for more.
When I was putting together the lesson plans for this, I wanted to make my own query for the comparisons, not borrow one from another site or blog. Yes, I borrow plenty, but I wanted this to be mine. When I was presenting my “code tuning” class, I had recently upgraded my instance from 2012 to 2017. I had also put my database into 2017 compatibility mode. I had used this query to show that unions that are intensive can cause issues with tempdb and cause spill over. To my “joy”, when I ran the query in the class I did not get the tempdb spillover. And right then I realized that I was not in Kansas(2012 compatibility) any longer. But this proved to be opportunistic for the statistics/optimizer comparison.
Read on for a discussion of the cardinality estimator as well.
Azure SQL Data Warehouse is a massively parallel processing (MPP) architecture designed for large-scale data warehouses. An MPP system creates logical / physical slices of the data. In SQL Data Warehouse’s case, the data has 60 logical slices, at all performance tiers. This means that a single table can have up to 60 different object_ids. This is why, in SQL Data Warehouse, there is the concept of physical and logical object_ids along with physical names.
Below is a query for finding row counts of tables in SQL Data Warehouse which accounts for the differences in architecture between my earlier script, written for SQL Server, and SQL Data Warehouse.
Click through for the script.
The solution in many cases to simply update statistics on the underlying system tables indexes used by the problem DMV queries. This can be done selectively by identifying the system table indexes referenced in execution plan seek and scan operators of the problem query execution plan and then executing UPDATE STATISTICS on each index. However, the task is somewhat tedious.
Alternatively, one can simply update stats on all the system tables. Below is a script that generates and executes DDL to update stats on all non-empty system tables, making quick work of this.
Click through for the script, as well as an interesting note if you try to use constructs like @sql = @sql + N’some string’ in your code.
We know that the first query will return 500k rows and the second query will return 0 rows. However, can SQL Server know that? Each statistics object only contains information about its own column. There’s no correlation between the
MOD_FILTERcolumns, so there isn’t a way for SQL Server to know that the queries will return different estimates. The query optimizer can create an estimate based on the filters on the
WHEREclause and on the histograms of the join columns, but there’s no foolproof way to do that calculation. The presence of the filters introduces uncertainty into the estimate, even with statistics that perfectly describe the data for each column. The containment assumption is all about the modeling assumption that SQL Server has to make to resolve that uncertainty.
It’s an interesting post aimed at trying to get you to think like a simplified cardinality estimator. SQL Server doesn’t behave exactly like this, but it’s a good mental reference point.
By default, when Auto Update Statistics is set to True, the SQL Server Query Optimizer will automatically update statistics when data has met a threshold of changes (insert, update, delete, or merge) and the estimated rows are now potentially stale. When statistics are stale, execution plans can become suboptimal which can lead to degradation in performance.
This best practice option ensures your statistics stay up to date as much as possible. Each time a cached query plan is executed the Optimizer checks for data changes and potentially generates new statistics. This behavior is exactly what we want, but there is a catch. The caveat to this is that a cached query plan will be “held” while the statistics are updated and will recompile to use the new values before running. This caveat can slow down the execution process dramatically.
The advice Monica provides is generally sound, though there are rare cases when asynchronous statistics updates end up causing more problems than they solve, so test the change first.
SQL Server 2008
Filtered statistics are introduced, and these can be created separately from a filtered index. There are some limitations around filtered indexes with regard to the Query Optimizer (see Tim Chapman’s post The Pains of Filtered Indexes and Paul White’s post Optimizer Limitations with Filtered Indexes) post, and it’s important to understand the behavior of the counter that tracks modifications (and thus can trigger automatic updates). See Kimberly’s post Filtered indexes and filtered stats might become seriously out-of-date for more details, and I also recommend checking out her stored procedure that analyzes data skew and recommends where you can create filtered statistics to provide more information to the Query Optimizer. I’ve implemented this for several large customers that have VLTs and skewed distribution across columns frequently used in predicates.
Two new catalog views, sys.stats and sys.stats_columns, are added to provide easier insight into statistics and included columns. Use these two views instead of sp_helpstats, which is deprecated and provides less information.
This is a very interesting historical look. Most interesting to me was the decreases in the number of steps available.
This does not happen. The cost remains the same as before: 0.294842 units. This is because the scan is costed according to density instead of by looking at the histogram of the outer table. The following query with a local variable repeated five times also has a cost of 0.294842 optimizer units:DECLARE @var BIGINT = 1; SELECT * FROM ( VALUES (@var), (@var), (@var), (@var), (@var) ) s (ID) WHERE NOT EXISTS ( SELECT 1 FROM dbo.BIG_HEAP b WHERE s.ID = b.ID ) OPTION (NO_PERFORMANCE_SPOOL);
The problem with using density instead of looking at the data in the outer table is mostly apparent when the outer table contains rows without a match in the inner table.
It’s a great bit of investigative legwork and Joe has a Connect item he’d like you to upvote.
On more than one occasion I have had an emergency request because everything was broken. The everything in almost every incident is an SSIS package that is failing with error messages. The error message will typically have text similar to the following:
Could not locate statistics ‘_WA_Sys_00000015_346C780E’ in the system catalogs.
Due to the error, the package fails processing and grinds to a halt. When diving into the package it is discovered that the missing stats happen to be coming from a linked server query. This raises a big bright blaring alarm for me. Why is the SSIS package accessing the data via a linked server? This is rather counter-productive and definitely contrary to what is desired from a performance perspective.
Jason methodically walks us through the troubleshooting process and provides the solution at the end.
SQL Server keeps track of how many inserts and deletes since last stats update – when the number of inserts/deletes exceeds the stats update threshold the next time a query requests those stats it’ll qualify for an update. Trace flag 2371 alters the threshold function before SQL Server 2016. With 2016 compatibility mode, the T2371 function becomes default behavior. Auto-stats update and auto-stats update async settings of the database determine what happens once the stats qualify for an update. But whether an auto-stats update or a manual stats update, the density, histogram, etc are all updated.
Trace flags 2389, 2390, 4139, and the ENABLE_HIST_AMENDMENT_FOR_ASC_KEYS hint operate outside the full stats framework, bringing in the quickstats update. They have slightly different scope in terms of which stats qualify for quickstats updates – but in each case its *only* stats for indexes, not stats for non-indexed columns that can qualify. After 3 consecutive stats updates on an index, SQL Server “brands” the stats type as ascending or static, until then it is branded ‘unknown’. The brand of a stat can be seen by setting trace flag 2388 at the session level and using dbcc show_statistics.
Right now there are just a few details and several links, but it does look like he’s going to expand it out.