Five days before Christmas I committed my patch to add incremental backup to PostgreSQL. Actually, I’ve been committing preparatory patches for some months now, but December 20 saw the two main patches land. Since then, there’s been a bunch of bug-fix commits, and there are still a few pending items that need to be addressed, but the core of the feature is now committed. If you want a quick overview of the feature, Lukas Fittl has a great video about that. Here, I’d like to talk about the architecture of the feature itself in a little more detail, and specifically with how we decide which data to copy.
Most people who are likely to read this blog are probably already familiar with the core idea of an incremental backup: instead of copying the whole database instance, just copy the data that has changed. That’s faster, and takes up less space on disk. But, to work properly, you have to be able to quickly and reliably identify which data has, in fact, changed. There’s more than one way to do that.
Read on for some of the complexity around this. It’s interesting to see what goes on behind the scenes in a relational database.