Comparative Query Analysis and Planning

Christophe Pettus has a two-parter. The first post covers how a half-dozen data platform technologies handle cost-based analysis:

PostgreSQL has ANALYZE. You run it (or autovacuum runs it for you), it draws a sample of 300 × default_statistics_target rows, and it writes a row per column into pg_statistic: a null fraction, an n-distinct estimate, a most-common-values list, an equi-depth histogram, and a physical-vs-logical correlation. The planner reads those numbers, multiplies selectivities together, costs a handful of join strategies, and picks one. Three join algorithms are on the menu: nested loop, merge join, hash join.

That is the entire shape of the problem, and every cost-based optimizer ever shipped solves the same one. They differ in three places, and only three: where the numbers come from, how stale the numbers are allowed to get, and which plan shapes are even legal to choose between. The algorithms are the boring part. Everybody hash-joins. The interesting part is the bookkeeping.

Then there’s how each of the systems generates a query plan:

Statistics are the input. Planning is what the database does with them: it takes a declarative query, which describes what you want and says nothing about how, and turns it into an executable plan, which is nothing but how. There are two jobs inside that. First, rewrite the query into a logically equivalent but more tractable shape, which is where subquery flattening, predicate pushdown, and view merging live. Second, search the space of physical plans (join orders, join algorithms, access paths) for the cheapest one the cost model can find. The second job is the hard one, because the number of possible join orders for a query grows faster than anyone wants to contemplate, and every database in this article is, underneath, a strategy for not enumerating all of them.

Two questions separate the six systems here. How does each one tame that search space? And once it has an answer, how much will it let you argue with the result? Those sound like the same question. They are not, and the most useful thing this comparison does is pull them apart. A database can search brilliantly and refuse you any override at all (Snowflake), search crudely and hand you a fistful of hints anyway (MySQL until recently), or search hard and expose every lever ever machined (Oracle). Sophistication of the search and generosity of the control surface are independent axes. Knowing where a system sits on each tells you most of what its planner feels like to live with.

Slightly odd is that there’s a section of DB2 but not on SQL Server. But it is a good cross-comparison of several of the top relational database options.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30