Press "Enter" to skip to content

Month: January 2024

Incremental Backup in Postgres

Robert Haas talks about a new feature:

Five days before Christmas I committed my patch to add incremental backup to PostgreSQL. Actually, I’ve been committing preparatory patches for some months now, but December 20 saw the two main patches land. Since then, there’s been a bunch of bug-fix commits, and there are still a few pending items that need to be addressed, but the core of the feature is now committed. If you want a quick overview of the feature, Lukas Fittl has a great video about that. Here, I’d like to talk about the architecture of the feature itself in a little more detail, and specifically with how we decide which data to copy.

Most people who are likely to read this blog are probably already familiar with the core idea of an incremental backup: instead of copying the whole database instance, just copy the data that has changed. That’s faster, and takes up less space on disk. But, to work properly, you have to be able to quickly and reliably identify which data has, in fact, changed. There’s more than one way to do that.

Read on for some of the complexity around this. It’s interesting to see what goes on behind the scenes in a relational database.

Comments closed

Pagination in Stored Procedures

Erik Darling hits on a bugbear of mine:

A common-enough practice to limit search results is to write paginated queries. You may give users the ability to specify how many rows per page they want to see, or you may have a set number of rows per page.

But the end goal is to allow users to very quickly get a smaller number of rows returned to them. Almost no one needs to get many thousands of rows back, unless they’re planning on exporting the data.

Pagination is such a common activity that I wish there were a way to say, “Here is my data. Hang onto it in memory and quickly retrieve a subset of rows upon request” without doing all kinds of shenanigans on my end. Something like a data snapshot that remains in memory as long as the session is active, until the service restarts, until there is memory pressure, or until the caller manually evicts the data. That would make OFFSET and FETCH really useful instead of barely usable for most pagination scenarios because you wouldn’t need to re-run the entire query for every offset/fetch page.

There are ways to make pagination efficient, but the efficient ones aren’t easy or intuitive.

Comments closed

New Challenge: 2D Interval Packing

Itzik Ben-Gan has a new challenge for the new year:

Packing intervals is a classic SQL task that involves packing groups of intersecting intervals to their respective continuous intervals. In mathematics, an interval is the subset of all values of a given type, e.g., integer numbers, between some low value and some high value. In databases, intervals can manifest as date and time intervals representing things like sessions, prescription periods, hospitalization periods, schedules, or numeric intervals representing things like ranges of mile posts on a road, temperature ranges, and so on.

An example for an interval packing task is packing intervals representing sessions for billing purposes. If you conduct a web search of the term packing intervals SQL, you’ll find scores of articles on the subject, including my own.

Read on for more information about this challenge and one solution to it.

Comments closed

An Overview of Polars

Dylan Jones talks about a Rust-based data frame library:

Polars is a high-performance DataFrame library implemented in Rust, and can be used with Rust natively or via its Python wrapper. It is designed to handle large datasets with ease, providing an user-friendly interface for data manipulation and analysis. The library offers two modules: polars-core for the core functionality, and polars-io for input/output operations, allowing you to read and write data in various formats such as CSV, JSON, Parquet, Delta and more.

Read on to see how it works in Python compared to Pandas, as well as some speed comparisons.

Comments closed

A Call for Quality

Kurt Buhler sounds the clarion call:

We have a quality problem, and it’s getting worse. It creates higher costs, hurts our productivity, and threatens our capability to achieve success. The problem: too often, we prioritize quicker results and newer features over lasting quality and consistency in the data and analytics solutions that we deliver. Too often, we don’t collect the right requirements, we don’t test, we don’t automate, and we rely on hope and heroism to save the day. The result: we’re besieged by issues, fighting constant battles against an avoidable enemy that we ourselves created.

This is a long article with a lot of depth to it. I think the topic is well worth thinking about, though it’s quite a challenge.

Comments closed

Useful Operations in dbatools

Rod Edwards shows off some nice functionality in dbatools:

I often build solutions around the dbatools functions, the below is just some of my Operational favourites. With some I’ve included the output pipe that I use most frequently, but obviously, you can view and use the output however you choose to. Clearly, DBATools has many functions to add/remove/update SQL as well, but i’m just folking on the ‘gets’ here.

Naturally, as mentioned…its powershell, you can programmatically use this for any of your automation needs. Marvellous.

The ever growing list of commands can be found here: command index – dbatools . This can prove daunting to new users of the toolset, so here’s a starter for 10.

Click through for those 10.

Comments closed

Math Operations in T-SQL

Daniel Hutmacher builds a few functions:

As part of spending waaaaaay to much time trying to solve the 2023 Advent of Code challenges, I came across multiple instances where I had to dust off some old math that I hadn’t paid attention to since I went to school back in the 90ies.

So for my own convenience, and yours, I’ve built functions for some common math that you might perhaps encounter at some point. I found this whole experience to be a great way to familiarize myself with a lot of the new functionality in SQL Server 2022, including GENERATE_SERIES(), LEAST(), GREATEST() and more. The Github repo contains a SQL Server 2019 version where I’ve built drop-in versions of the 2022 functions, but they probably won’t perform as well as the built-in stuff.

Click through for demonstrations of determining whether something is a prime number, finding the greatest common divisor and least common multiple, factorization, factorials, and even a bit of combinatorics.

Comments closed

Thoughts on Clean Code

Chad Callihan resolves to write better code:

I’ve been involved with more official development work on top of database responsibilities in the last few months, which led to the recommendation to read Clean Code by Robert C. Martin. It’s an older book published in 2011, but plenty of rules and guidelines still apply. Along with the more technical details, one area jumped out at me in Chapter 12 that can apply to anyone writing code, queries, or scripts:

Click through for a pair of salient quotations and some more thoughts from Chad.

Comments closed