Press "Enter" to skip to content

Category: Internals

Handling a Consumer Fetch Request in Kafka

Multiple Confluent employees (who apparently don’t get to have names this time around) wrap up a series:

It’s been a long time coming, but we’ve finally arrived at the fourth and final installment of our blog series. In this series, we’ve been peeling back the layers of Apache Kafka® to get a deeper understanding of how best to interact with the cluster using producer and consumer clients.

Read on for the final part, as well as links to previous parts if you missed them.

Comments closed

The Difficulty of Deletion in PostgreSQL

Radim Marek takes us through the process of deleting rows:

Your database is ticking along nicely – until a simple DELETE brings it to its knees. What went wrong? While we tend to focus on optimizing SELECT and INSERT operations, we often overlook the hidden complexities of DELETE. Yet, removing unnecessary data is just as critical. Outdated or irrelevant data can bloat your database, degrade performance, and make maintenance a nightmare. Worse, retaining some types of data without valid justification might even lead to compliance issues.

Read on to learn about the process, some challenges, and a common pattern for resolving these challenges. The solution is pretty similar in SQL Server as well: batching delete operations, ideally with a supporting index.

Comments closed

Query Hash Value Changes

Kendra Little digs into a problem:

This is the worst bug I’ve found in SQL Server to date. Previously, my top find was SQL Server Online Index Rebuild sometimes happens offline without warning. This one has taken top slot because it makes my life more difficult on a daily basis.

Background: SQL Server generates a query_hash for each query. This is stored in sys.query_store_query and it’s one of the primary ways you can identify what a query is across different Query Stores, or even the same Query Store over time, as surrogate query_id values get reset if Query Store is cleared or data ages on. The query_hash is a “Zobrist hash over the shape of the individual query, based on the bound (input) logical query tree. Query hints aren’t included as part of the hash.” (Source)

Except that’s wrong.

Read the whole thing.

Comments closed

What Happens upon Row Modification in Postgres

Semab Tariq takes us on a journey:

But behind the scenes, it’s more complex to ensure that data remains consistent and accessible. In today’s blog, I’ll answer some frequently asked questions from our customers and dive into why PostgreSQL relies on a process called VACUUM to efficiently manage data updates.

Updating a row in PostgreSQL isn’t as straightforward as directly changing the existing data. Instead, PostgreSQL avoids in-place updates, meaning it doesn’t overwrite rows directly. 

But what does this actually mean? 

Read on to learn what that actually means.

Comments closed

Column Eviction in Power BI and Direct Lake

Paul Turley talks about fashion:

One of the core best practice guidance principals for Power BI modeling is to avoid including columns that aren’t absolutely necessary for analytic reporting. Every column uses precious memory and especially long, unique values that don’t compress very well. When consulting clients bring me large models that require expensive capacity licensing and pose report performance issues, my first inclination is to see what column data can be carved out of the model; and perhaps moved to another table for a drill-through report.

The product team came up with a very clever way to reduce the in-memory footprint of a Direct Lake semantic model: hold a popularity contest! The semantic model engine will only keep columns in memory based on their hotness. I mean this literally…

Read on to learn a bit more about the algorithm in play and how it differs from a naive Least Recently Used cache.

Comments closed

Value Filter Behavior in Power BI

Jeffrey Wang digs into a new feature:

The October 2024 Power BI update introduces an inconspicuous yet significant preview feature: Value Filter Behavior. This feature is activated by setting a new model-level property, ValueFilterBehavior, to Independent. The default setting of Automatic preserves the existing behavior, at least during the public preview period. This property controls how the DAX SUMMARIZECOLUMNS function behaves, which is central to most DAX queries generated by Power BI visuals.

Don’t just take my world for it — create any Power BI visual by adding columns, filters, and measures. If you are familiar with the Performance Analyzer or other tools that capture the DAX query issued by the visual, you will see something like this:

Read on for Jeffrey’s example and a dive into what’s going on.

Comments closed

Preparing a Fetch Operation in a Kafka Consumer

Danica Fine continues a series on Kafka internals:

Welcome back to the third installment of our blog series where we’re diving into the beautiful black box that is Apache Kafka® to better understand how we interact with the cluster through producer and consumer clients.

Earlier in the series, we took a look at the Kafka producer to see how the client works before following a produce request as it’s processed by the cluster.

In this post, we’ll switch our attention to Kafka Consumer clients to see how consumers interact with the brokers, coordinate their partitions, and send requests to read data from your Kafka topics.

Read on to see what it takes for a consumer to operate in Apache Kafka.

Comments closed

Transforming Queries Based on Human Intent

Andrei Lepikhov and Alena Rybakina ask a question:

As usual, this project was prompted by multiple user reports with typical complaints, like ‘SQL server executes the query times faster’ or ‘Postgres doesn’t pick up my index’. The underlying issue that united these reports was frequently used VALUES sequences, typically transformed in the query tree into an SEMI JOIN.

I also want to argue one general question: Should an open-source DBMS correct user errors? I mean optimising a query even before the search for an optimal plan begins, eliminating self-joins, subqueries, and simplifying expressions – everything that can be achieved by proper query tuning. The question is not that simple since DBAs point out that the cost of query planning in Oracle overgrows with the complexity of the query text, which is most likely caused, among other things, by the extensive range of optimisation rules.

My short answer is, yes. SQL is a 4th generation language, meaning that end users describe the results they need but leave it to the engine to determine how to get there. As performance tuners, we may understand some of the foibles of the database engine and how it does (or does not) perform these translations, but in an ideal world, every unique representation of an end state for a given query should have the same, maximally optimized internal way of getting there. This is impossible in practice, but it should be a guiding principle for engine behavior.

Comments closed

The Internals of Data Updates in PostgreSQL

Cary Huang explains how update operations work:

In previous blogs, we talked about an overview of PostgreSQL’s table access method API here , how sequential scan is handled within this API here, and how data insertion is handled here. Today in this blog, we will look closely into how PostgreSQL handles update. A successful update in PostgreSQL can be viewed as “insert a new record” while “marking the old record as invisible” due to the MVCC technique that PostgreSQL employs. It sounds simple enough, but there are quite a lot of considerations in place to make a successful update. Let’s dive in.

There’s a lot going on behind the scenes, and Cary does a good job of explaining it all.

Comments closed