Press "Enter" to skip to content

Category: Statistics

Reducing Row Count Estimation Errors in PostgreSQL

Shinya Kato lays out four approaches:

PostgreSQL’s query planner relies on table statistics to estimate the number of rows (estimated rows) each operation will process, and then selects an optimal execution plan based on these estimates. When the estimated rows diverge significantly from the actual rows, the planner can choose a suboptimal plan, leading to severe query performance degradation.

This article walks through four approaches I used to reduce row count estimation errors, ordered from least to most invasive. Due to confidentiality constraints, I cannot share actual SQL or execution plans, so the focus is on the diagnostic thought process and the techniques applied.

Click through for those thought processes.

Leave a Comment

Automatic Stats Updates and Plan Cache Invalidation

Brent Ozar threads the needle:

Normally, when SQL Server updates statistics on an object, it invalidates the cached plans that rely on that statistic as well. That’s why you’ll see recompiles happen after stats updates: SQL Server knows the stats have changed, so it’s a good time to build new execution plans based on the changes in the data.

However, updates to system-created stats don’t necessarily cause plan recompiles.

Read on for the rare situation in which invalidation doesn’t happen.

Also, Brent has me wondering if the lemon popcorn is more sweet/tart (like a lemon bar) or savory (like lemon pepper) and I may have to try both styles.

Leave a Comment

Table Statistics and Planning Slowdowns

Andrei Lepikhov digs into a performance issue:

A query executes in just 2 milliseconds, yet its planning phase takes 500 ms. The database is reasonably sized, the query involves 9 tables, and the default_statistics_target is set to only 500. Where does this discrepancy come from?

This question was recently raised on the pgsql-performance mailing list, and the investigation revealed a somewhat surprising culprit: the column statistics stored in PostgreSQL’s pg_statistic table.

Read on for Andrei’s analysis and some interesting thoughts on possible avenues for improvement.

Leave a Comment

Statistics on Partitioned Tables in PostgreSQL

Laurenz Albe gathers stats:

I recently helped a customer with a slow query. Eventually, an ANALYZE on a partitioned table was enough to fix the problem. This came as a surprise for the customer, since autovacuum was enabled. So I decided to write an article on how PostgreSQL collects partitioned table statistics and how they affect PostgreSQL’s estimates.

Read on to see how it works and how you can generate statistics at the table level and not just the partition level.

Comments closed

PostgreSQL Query Tuning for the Oracle DBA

Kellyn Gorman continues a series on PostgreSQL for Oracle DBAs:

As an Oracle DBA venturing into the world of PostgreSQL, one of the most important areas to get comfortable with is performance tuning. While Oracle’s Cost-Based Optimizer (CBO) is a well-known powerhouse that many DBAs have learned to both respect and wrestle with, PostgreSQL offers its own sophisticated query planner that behaves differently.  Understanding these differences is key to becoming proficient with PostgreSQL as we step into tuning.

Read on to gain an overview of what options are available on the PostgreSQL side and how they differ from what’s available in Oracle.

Comments closed

Maintaining Statistics Information Post-Update in PostgreSQL 18

Laurenz Albe takes a peek at an upcoming feature:

Everybody wants good performance. When it comes to the execution of SQL statements, accurate optimizer statistics are key. With the upcoming v18 release, PostgreSQL will preserve the optimizer statistics during an upgrade with dump/restore or pg_upgrade (see commit 1fd1bd8710 and following). With the beta testing season for PostgreSQL v18 opened, it is time to get acquainted with the new feature.

It’s kind of wild to me that this wasn’t in place years ago for PostgreSQL.

Comments closed

Automated Table Statistics on Delta Tables in Microsoft Fabric

Santhosh Kumar Ravindran makes an announcement:

We’re thrilled to introduce Automated Table Statistics in Microsoft Fabric Data Engineering — a major upgrade that helps you get blazing-fast query performance with zero manual effort.

Whether you’re running complex joins, large aggregations, or heavy filtering workloads, Fabric’s new automated statistics will help Spark make smarter decisions, saving you time, compute, and money.

Click through to see what’s included, as well as the limitations associated with this. You can still create manual statistics if you’d like, so on the whole, I approve.

Comments closed

Comparing Varieties of Statistics in SQL Server

Kendra Little gets the smorgasbord:

Statistics in SQL Server are simple in theory: they help the optimizer estimate how many rows a query might return.

In practice? Things get weird fast. Especially when you start filtering on multiple columns, or wondering why the optimizer thinks millions of rows are coming back when you know it’s more like a few hundred thousand.

In this post, I’ll walk through examples using single-column, multi-column, and filtered statistics—and show where estimates go off the rails, when they get back on track, and why that doesn’t always mean you need to update everything with FULLSCAN.

Read on for a review of the three types of statistics. Admittedly, I’ve never had much luck with filtered statistics improving the performance of queries. If I were to speculate, I’d say that they’re good for a very specific type of problem that maybe I just don’t run into that often.

Comments closed

Automating Management of Extended Statistics in PostgreSQL

Andrei Lepikhov builds an extension:

The extended statistics tool allows you to tell Postgres that additional statistics should be collected for a particular set of table columns. Why is this necessary? – I will try to quickly explain using the example of an open power plant database. For example, the fuel type (primary_fuel) used by a power plant is implicitly associated with the country’s name.

Click through to learn more about what extended statistics are and the nature of the extension.

Comments closed

Multi-Column Statistics in PostgreSQL

Hans-Jürgen Schönig creates new statistics:

If you are using PostgreSQL for analytics or large-scale aggregations, you might occasionally notice the planner making false assumptions regarding the number of rows. While this isn’t a problem for small aggregates, it is indeed an issue for large-scale aggregations featuring many different dimensions.

In short: The more columns your GROUP BY statement contains, the more likely it is that optimizer overestimates the row count.

This blog explains how this can be handled in PostgreSQL.

Maybe it’s just me, but I don’t recall many instances in which adding multi-column statistics without any sort of index change significantly improved a query’s performance. I can understand how it could improve things like memory grants, so perhaps that’s how I’m selling it short. But I struggle to recall a specific case in which a query got measurably faster as a result.

Comments closed