The SQL Engine we are running in Azure SQL Database is the very latest version of the same engine customers run on their own servers, except we manage and update it. To update SQL Server or the underlying infrastructure (i.e. Service Fabric or the operating system), we must stop the SQL Server process. If that process hosts the primary database replica, we move the replica to another machine (requiring a failover).
During failover, the database may be offline for a second and still meet our 99.995% SLA. However, failover of the primary replica impacts workload because it aborts in-flight queries and transactions. We built features such as resumable index (re)build and accelerated database recovery to address these situations, but not all running operations are automatically resumable. It may be expensive to restart complex queries or transactions that were aborted due to an upgrade. So even though failovers are quick, we want to avoid them.
Read on to see how they do it. There’s no on-prem analogue yet, though perhaps that will come in time.