Press "Enter" to skip to content

Category: Statistics

Porting Statistics in PostgreSQL

Radim Marek imports production statistics:

In the previous article we covered how the PostgreSQL planner reads pg_class and pg_statistic to estimate row counts, choose join strategies, and decide whether an index scan is worth it. The message was clear: when statistics are wrong, everything else goes with it.

But there was one thing we didn’t talk about. Statistics are specific to the database cluster that generated them. The primary way to populate them is `ANALYZE` which requires the actual data.

Click through to see how Postgres handles this. It’s quite similar to SQL Server’s DBCC CLONEDATABASE in practice, it seems.

Leave a Comment

Permanently Empty Statistics

Guy Glantser takes us through an edge case:

Many SQL Server DBAs rely on automated statistics maintenance solutions such as Ola Hallengren’s maintenance scripts. These scripts typically update statistics only when the modification counter exceeds a threshold.

But there is a corner case that can cause statistics to remain empty forever, and many DBAs are not aware of it.

Read on to see how you can end up with no statistics at all on a table.

Leave a Comment

Reducing Row Count Estimation Errors in PostgreSQL

Shinya Kato lays out four approaches:

PostgreSQL’s query planner relies on table statistics to estimate the number of rows (estimated rows) each operation will process, and then selects an optimal execution plan based on these estimates. When the estimated rows diverge significantly from the actual rows, the planner can choose a suboptimal plan, leading to severe query performance degradation.

This article walks through four approaches I used to reduce row count estimation errors, ordered from least to most invasive. Due to confidentiality constraints, I cannot share actual SQL or execution plans, so the focus is on the diagnostic thought process and the techniques applied.

Click through for those thought processes.

Comments closed

Automatic Stats Updates and Plan Cache Invalidation

Brent Ozar threads the needle:

Normally, when SQL Server updates statistics on an object, it invalidates the cached plans that rely on that statistic as well. That’s why you’ll see recompiles happen after stats updates: SQL Server knows the stats have changed, so it’s a good time to build new execution plans based on the changes in the data.

However, updates to system-created stats don’t necessarily cause plan recompiles.

Read on for the rare situation in which invalidation doesn’t happen.

Also, Brent has me wondering if the lemon popcorn is more sweet/tart (like a lemon bar) or savory (like lemon pepper) and I may have to try both styles.

Comments closed

Table Statistics and Planning Slowdowns

Andrei Lepikhov digs into a performance issue:

A query executes in just 2 milliseconds, yet its planning phase takes 500 ms. The database is reasonably sized, the query involves 9 tables, and the default_statistics_target is set to only 500. Where does this discrepancy come from?

This question was recently raised on the pgsql-performance mailing list, and the investigation revealed a somewhat surprising culprit: the column statistics stored in PostgreSQL’s pg_statistic table.

Read on for Andrei’s analysis and some interesting thoughts on possible avenues for improvement.

Comments closed

Statistics on Partitioned Tables in PostgreSQL

Laurenz Albe gathers stats:

I recently helped a customer with a slow query. Eventually, an ANALYZE on a partitioned table was enough to fix the problem. This came as a surprise for the customer, since autovacuum was enabled. So I decided to write an article on how PostgreSQL collects partitioned table statistics and how they affect PostgreSQL’s estimates.

Read on to see how it works and how you can generate statistics at the table level and not just the partition level.

Comments closed

PostgreSQL Query Tuning for the Oracle DBA

Kellyn Gorman continues a series on PostgreSQL for Oracle DBAs:

As an Oracle DBA venturing into the world of PostgreSQL, one of the most important areas to get comfortable with is performance tuning. While Oracle’s Cost-Based Optimizer (CBO) is a well-known powerhouse that many DBAs have learned to both respect and wrestle with, PostgreSQL offers its own sophisticated query planner that behaves differently.  Understanding these differences is key to becoming proficient with PostgreSQL as we step into tuning.

Read on to gain an overview of what options are available on the PostgreSQL side and how they differ from what’s available in Oracle.

Comments closed

Maintaining Statistics Information Post-Update in PostgreSQL 18

Laurenz Albe takes a peek at an upcoming feature:

Everybody wants good performance. When it comes to the execution of SQL statements, accurate optimizer statistics are key. With the upcoming v18 release, PostgreSQL will preserve the optimizer statistics during an upgrade with dump/restore or pg_upgrade (see commit 1fd1bd8710 and following). With the beta testing season for PostgreSQL v18 opened, it is time to get acquainted with the new feature.

It’s kind of wild to me that this wasn’t in place years ago for PostgreSQL.

Comments closed

Automated Table Statistics on Delta Tables in Microsoft Fabric

Santhosh Kumar Ravindran makes an announcement:

We’re thrilled to introduce Automated Table Statistics in Microsoft Fabric Data Engineering — a major upgrade that helps you get blazing-fast query performance with zero manual effort.

Whether you’re running complex joins, large aggregations, or heavy filtering workloads, Fabric’s new automated statistics will help Spark make smarter decisions, saving you time, compute, and money.

Click through to see what’s included, as well as the limitations associated with this. You can still create manual statistics if you’d like, so on the whole, I approve.

Comments closed

Comparing Varieties of Statistics in SQL Server

Kendra Little gets the smorgasbord:

Statistics in SQL Server are simple in theory: they help the optimizer estimate how many rows a query might return.

In practice? Things get weird fast. Especially when you start filtering on multiple columns, or wondering why the optimizer thinks millions of rows are coming back when you know it’s more like a few hundred thousand.

In this post, I’ll walk through examples using single-column, multi-column, and filtered statistics—and show where estimates go off the rails, when they get back on track, and why that doesn’t always mean you need to update everything with FULLSCAN.

Read on for a review of the three types of statistics. Admittedly, I’ve never had much luck with filtered statistics improving the performance of queries. If I were to speculate, I’d say that they’re good for a very specific type of problem that maybe I just don’t run into that often.

Comments closed