Press "Enter" to skip to content

Category: Statistics

Limitations Of Statistics

Erik Darling discusses statistics and limitations of statistics outside of indexes:

Head rush moment: SQL may use information from histograms outside of the index it chooses for cardinality estimation.

Back to earth: If you just create a bunch of statistics instead of indexes, you’re (at best) using your Clustered Index for everything (which is still bad), or you’re using a HEAP for everything (which is usually worse). You’re still generally better off creating good indexes for your workload. They’ll get statistics objects created and associated with them, and if SQL thinks another column is interesting, it will create a single column statistics object for it, as long as you haven’t turned off auto create stats.

Sure, you can put on your black cloak and goat mask and create some multi-column or filtered statistics, but in the words of a wise man (Doug), you end up with more stats to maintain and understanding query behavior gets more difficult.

Erik argues that at the margin, keeping existing statistics up to date tends to be a better solution than creating new statistics.

Comments closed

Indexes And Stats

Brent Ozar looks at a case when adding a suggested index monkeys with stats:

The query runs faster, make no mistake – but check out the estimates:

  • Estimated number of rows = 1
  • Actual number of rows = 165,367

Those estimates are built by SQL Server’s cardinality estimator (CE), and there have been major changes to it over the last couple of versions. You can control which CE you’re using by changing the database’s compatibility level. This particular StackOverflow database is running in 2016 compat mode – so what happens if we switch it back to 2012 compat mode?

Based on this result, there might be further optimizations available.  Read on for more of Brent’s thoughts.

Comments closed

Understanding Estimation Variance

Aaron Bertrand discusses drift between estimates and actuals in execution plans:

Now, inaccurate estimates won’t always be a problem, but it can cause issues with inefficient plan choices at the two extremes. A single plan might not be optimal when the chosen range will yield a very small or very large percentage of the table or index, and this can get very hard for SQL Server to predict when the data distribution is uneven. Joseph Sack outlined the more typical things bad estimates can affect in his post, “Ten Common Threats to Execution Plan Quality:”

Read the whole thing.

Comments closed

Where Polybase Stats Live

I dig into where the statistics against a Polybase table actually live:

Today, we learned that Polybase statistics are stored in the same way as other statistics; as far as SQL Server is concerned, they’re just more statistics built from a table (remembering that the way stats get created involves loading data into a temp table and building stats off of that temp table).  We can do most of what you’d expect with these stats, but beware calling sys.dm_db_stats_properties() on Polybase stats, as they may not show up.

Also, remember that you cannot maintain, auto-create, auto-update, or otherwise modify these stats.  The only way to modify Polybase stats is to drop and re-create them, and if you’re dealing with a large enough table, you might want to take a sample.

The result isn’t very surprising in retrospect, and it’s good that “stats are stats are stats” is the correct answer.

Comments closed

Messing With Statistics

Erik Darling shows how to fake stats:

One thing I’ve found is that the inflated counts don’t seem to change anything for Identities, or Primary Keys. You’ll always get very reasonable plans and estimates regardless of how high you set row and page counts for those. Regular old clustered indexes are fair game.

Some really interesting things can start to happen to execution plans when SQL thinks there’s this many rows in a table. The first is that SQL will use a rare (in my experience) plan choice: Index Intersection. You can think of this like a Key Lookup but with two nonclustered indexes rather than from one nonclustered index to the clustered index.

This is very useful when you don’t have many rows in dev, can’t put many rows in dev, and can’t restore a stats-only database from prod.

Comments closed

Stats Terminology

Erik Darling fills in gaps on statistics terminology in his unique style:

SELECTIVITY

This tells you how special your snowflakes are. When a column is called “highly selective” that usually means values aren’t repeating all that often, if at all. Think about order numbers, identity or sequence values, GUIDs, etc.

DENSITY

This is sort of the anti-matter to selectivity. Highly dense columns aren’t very unique. They’ll return a lot of rows for a given value. Think about Zip Codes, Gender, Marital Status, etc. If you were to select all the people in 10002, a densely (there’s that word again) populated zip code in Chinatown, you’d probably wait a while, kill the query, and add another filter.

Combine that with Kendra Little’s statistics FAQ for additional learning.

Comments closed

Statistics FAQ

Kendra Little has a great FAQ on statistics, from the standpoint of developers as well as administrators:

I’ve been asked a lot of questions about updating statistics in SQL Server over the years. And I’ve asked a lot of questions myself! Here’s a rundown of all the practical questions that I tend to get about how to maintain these in SQL Server.

I don’t dig into the internals of statistics and optimization in this post. If you’re interested in that, head on over and read the fahhhbulous white paper, Statistics Used by the Query Optimizer in SQL Server 2008. Then wait a couple days and chase it with it’s charming cousin, Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator.

I’m also not talking about statistics for memory optimized tables in this article. If you’re interested in those, head over here.

This is a great distillation of important and oft-misunderstood content.

Comments closed

Auto-Update Stats Threshold Change

Erik Darling points out that automatic statistics updates will happen more frequently in 2016 for large tables than in prior versions:

Here’s an abridged version of 10-20 million and 30-40 million rows, and how many modifications they took before a stats update occurred. If you follow the PercentMod column down, the returns diminish a bit the higher up you get. I’m not saying that I’d prefer to wait for 20% + 500 rows to modify, by any stretch. My only point here is that there’s not a set percentage to point to.

And, because you’re probably wondering, turning on Trace Flag 2371 in 2016 doesn’t make any difference.

This is a good change, though as Erik points out, if you’re managing very large tables, you might already have the trace flag on and thereby won’t see any difference.

Comments closed

Specifying Statistic Row Counts

Daniel Hutmacher shows how to fake row counts using the UPDATE STATISTICS command:

In this case, the only change was that the query went parallel, which introduces a few more operators. However, a lot can change in a query when you scale the volume, and quite often, the entire layout of the plan can change dramatically.

Note that the query optimizer not only considers the rowcount, but often also the page count (which translates to how many megabytes of data need to be moved), so you may do well to include WITH PAGECOUNT as well.

This isn’t something I’ve ever done, but could be interesting in some scenarios, such as finding out how an application will run as the database grows.

Comments closed

Object Level Auto-Stats

Andy Mallon on auto-statistics:

I’m not going to say whether I think Auto Update Statistics should be on or off. Instead, I’m going to argue that there are definitely scenarios when you want to have this on AND there are scenarios where you want it turned off. Can you really have this both ways? Absolutely.

I’m going to say yes to auto-update unless you know the answer is no for an object.  But it’s nice to know that the fine-grained option is available.

Comments closed