Internals – Page 5 – Curated SQL

Transforming Queries Based on Human Intent

Published 2024-10-04 by Kevin Feasel

Andrei Lepikhov and Alena Rybakina ask a question:

As usual, this project was prompted by multiple user reports with typical complaints, like ‘SQL server executes the query times faster’ or ‘Postgres doesn’t pick up my index’. The underlying issue that united these reports was frequently used VALUES sequences, typically transformed in the query tree into an SEMI JOIN.

I also want to argue one general question: Should an open-source DBMS correct user errors? I mean optimising a query even before the search for an optimal plan begins, eliminating self-joins, subqueries, and simplifying expressions – everything that can be achieved by proper query tuning. The question is not that simple since DBAs point out that the cost of query planning in Oracle overgrows with the complexity of the query text, which is most likely caused, among other things, by the extensive range of optimisation rules.

My short answer is, yes. SQL is a 4th generation language, meaning that end users describe the results they need but leave it to the engine to determine how to get there. As performance tuners, we may understand some of the foibles of the database engine and how it does (or does not) perform these translations, but in an ideal world, every unique representation of an end state for a given query should have the same, maximally optimized internal way of getting there. This is impossible in practice, but it should be a guiding principle for engine behavior.

Comments closed

The Internals of Data Updates in PostgreSQL

Published 2024-10-01 by Kevin Feasel

Cary Huang explains how update operations work:

In previous blogs, we talked about an overview of PostgreSQL’s table access method API here , how sequential scan is handled within this API here, and how data insertion is handled here. Today in this blog, we will look closely into how PostgreSQL handles update. A successful update in PostgreSQL can be viewed as “insert a new record” while “marking the old record as invisible” due to the MVCC technique that PostgreSQL employs. It sounds simple enough, but there are quite a lot of considerations in place to make a successful update. Let’s dive in.

There’s a lot going on behind the scenes, and Cary does a good job of explaining it all.

Comments closed

Kafka Internals: Handling a Producer Request

Published 2024-09-26 by Kevin Feasel

Danica Fine continues a series on Kafka internals:

Welcome to the second installment of our blog series to understand the inner workings of the beautiful black box that is Apache Kafka®.

We’re diving headfirst into Kafka to see how we actually interact with the cluster through producers and consumers. Along the way, we explore the configurations that affect each step of this epic journey and the metrics that we can use to more effectively monitor the process.

In the last blog, we explored what the Kafka producer client does behind the scenes each time we call producer.send() (or similar, depending on your language of choice). In this post, we follow our brave hero, a well-formed produce request, that’s on its way to the broker to be processed and have its data stored on the cluster.

Click through to learn more about how it all works.

Comments closed

Self-Joins and Halloween Protection

Published 2024-09-19 by Kevin Feasel

Paul White has an explanation:

I was asked recently why Halloween Protection was needed for data modification statements that include a self-join of the target table. This gives me a chance to explain, while also covering some interesting product bug history from the SQL Server 7 and 2000 days.

Read on for that explanation.

Paul’s explanation of the bug reminded me of the “quirky update” approach to building a running total, except that, instead of fixing a bug that eliminated it, the process always remained on a knife’s edge of “unsupported but works…at least until we change something and it doesn’t work anymore.”

Comments closed

The State of the ANY Aggregate Transformation

Published 2024-09-16 by Kevin Feasel

Paul White covers an aggregate operator:

SQL Server provides a way to select any one row from a group of rows, provided you write the statement using a specific syntax. This method returns any one row from each group, not the minimum, maximum or anything else. In principle, the one row chosen from each group is unpredictable.

The general idea of the required syntax is to logically number rows starting with 1 in each group in no particular order, then return only the rows numbered 1. The outer statement must not select the numbering column for this query optimizer transformation (SelSeqPrjToAnyAgg) to work.

Read on for information about this internal operator, a bug that existed in it for a long time, and the current state of fixes.

Comments closed

Vacuum, MERGE, and ON CONFLICT Directives

Published 2024-09-06 by Kevin Feasel

Shane Borden looks back at a directive:

I previously blogged about ensuring that the “ON CONFLICT” directive is used in order to avoid vacuum from having to do additional work. You can read the original blog here: Reduce Vacuum by Using “ON CONFLICT” Directive

Now that Postgres has incorporated the “MERGE” functionality into Postgres 15 and above, I wanted to ensure that there was no “strange” behavior as it relates to vacuum when using merge.

Read on to see how things look now.

Comments closed

Search in a Vector Database

Published 2024-09-05 by Kevin Feasel

Brendan Tierney continues a series on vector databases:

Searching semantic similarity in a data set is now equivalent to searching for nearest neighbors in a vector space instead of using traditional keyword searches using query predicates. The distance between “dog” and “wolf” in this vector space is shorter than the distance between “dog” and “kitten”. A “dog” is more similar to a “wolf” than it is to a “kitten”.

Click through to learn more about some of the common techniques for performing similarity searches against vectorized data.

Comments closed

Performance Profile of Fast-Forward Cursors

Published 2024-09-03 by Kevin Feasel

Hugo Kornelis continues a deep dive into cursors:

One of the things that has always bothered me about the fast forward cursor type is the shocking lack of documentation of what it does exactly. Okay, the name suggests that it is fast. But is it really? When I first looked at cursor performance (granted, a long time ago), I found that a static cursor was actually faster than a fast forward cursor for the same query. So… fastish forward?

The name also suggests that this cursor is forward only. That is indeed the case. At least they got that right in the naming.

Read on to learn more about this cursor property, as well as how it compares to static and dynamic cursors. Hugo ends on a spicy take you won’t want to miss.

Comments closed

Write-Ahead Logging in PostgreSQL

Published 2024-08-21 by Kevin Feasel

Semab Tariq continues a series on internals by covering write-ahead logging:

Write-Ahead Logging (WAL) is a critical feature in PostgreSQL that ensures data integrity and durability. It records changes to the database before they are applied, allowing PostgreSQL to recover data and restore the database to its most recent state in case of a crash or failure.

Before the introduction of Write-Ahead Logging (WAL) in PostgreSQL, PostgreSQL relied on a simpler mechanism for ensuring data integrity, which was less robust and did not support advanced features like point-in-time recovery and replication.

Read on for a bit of history, how write-ahead logging has changed over the years, and the basics of how it works. Write-ahead logging is a common feature in major relational databases and serves a critical purpose for database consistency. There are also some sneaky performance gains you can get out of it.

Comments closed

Query Re-Optimization in Postgres

Published 2024-08-20 by Kevin Feasel

Andrei Lepikhov walks through an interesting scenario:

What was the impetus to begin this work? It was caused by many real cases that may be demonstrated clearly by the Join Order Benchmark. How much performance do you think Postgres loses if you change its preference of employing parallel workers from one to zero? Two times regression? What about 10 or 100 times slower?

The black line in the graph below shows the change in execution time of each query between two cases: with parallel workers disabled and with a single parallel worker per gather allowed. For details, see the test script and EXPLAINs, with and without parallel workers.

Click through for an overview of what Andrei wrote, including architectural notes. But stick around until the end to see just how difficult the challenge is to re-optimize without making performance worse in the end.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Category: Internals