Press "Enter" to skip to content

Category: Availability Groups

Unplanned Failover and SQL Server on Kubernetes

Anthony Nocentino performs additional testing:

In my planned failover walkthrough, I showed what happens when you deliberately move the primary role to another replica. That’s the easy case. Now I want to show what happens when the primary pod just disappears unexpectedly, like during a node failure or a container crash. No graceful shutdown, no demotion, just gone.

I ran two test scenarios, each cycling the primary role across all three pods by force-deleting the current primary three times in a row. First, a 5GB TPC-C database idle. Then, that same 5GB database under sustained HammerDB TPC-C load. Six force-deletes total, six successful automatic failovers. I’ll walk through the error log from the promoted replica, the operator’s detection and recovery behavior, and the full timing data.

Read on to see how Anthony’s SQL Server Kubernetes operator handles when things go bump in the night.

Leave a Comment

Planned Failover of Availability Groups on Kubernetes

Anthony Nocentino runs a test:

When building the sql-on-k8s-operator, I wanted to make sure it could handle both planned and unplanned failovers. The easy case is a planned failover, where you deliberately move the primary role to another replica. The harder case is an unplanned failover, where the primary pod just disappears. The operator needs to handle both.

I recently ran a full planned failover rotation on a three-replica SQL Server Availability Group managed by sql-on-k8s-operator, and I want to show you exactly what happens inside SQL Server and the operator during each hop. If you’ve been following my Introducing the SQL Server on Kubernetes Operator post, this is the logical next step: what does the error log actually look like during a planned failover, what does the operator do in response, and how long does the whole thing take?

I ran the same three-hop rotation twice: once with an idle 5GB database to establish a baseline, and once under a sustained TPC-C workload with HammerDB. In this post, I’ll walk through the SQL Server error log entries, the operator’s reconcile behavior, and the timing data for both runs. In the next blog post, I’ll show what happens during an unplanned failover. Let’s go.

Click through to see how it all works.

Comments closed

Fast Failover in SQL Server 2025

John Deardurff lays out some changes in SQL Server 2025:

A client requested a presentation discussing key improvements to Always On Availability Group fast failover in SQL Server 2025. I decided that a summary would be appropriate for a blog post. So, here I discuss Enhanced Telemetry, Persistent Health, and Intelligent Fast Failover 

John has a very positive take on fast failover. I haven’t tried any of this functionality, but some of this does sound promising.

Comments closed

Optimizing Planned Availability Group Failover in SQL Server

Aaron Bertrand shares some advice:

Shaving even a handful of seconds from the process can improve the application and end user experience; it can also drastically reduce alert noise or, at least, how long alerts have to stay muted. There’s a lot of material out there about performing AG failovers correctly (no data loss), but far less that focuses on shortening the disruption window. The difference is usually some combination of redo volume, checkpoint behavior, open transactions, and secondary readiness.

I wanted to share some techniques I use to make planned failovers faster and more predictable. Some of these techniques are well documented, while others come from real-world patterns I’ve observed across many SQL Server environments. I’ll talk about what I do before, during, and after the failover to minimize disruption and increase the chance that end users are oblivious that anything happened.

Aaron provides several tips to help reduce the pain of failover.

Comments closed

Migrating from a Contained Availability Group

Warren Departee undoes a problem:

A client was running a Contained Availability Group in SQL Server 2022, but wasn’t using the AG Listener for their application connections. This negated most of the benefits the Contained AG was designed to provide. They also had some security misunderstandings and missteps, as this was built for them without any real knowledge transfer – one of the reasons they reached out to us for help. After review, it became clear that there was no need for a contained AG here, so we helped them migrate to a Basic Availability Group (SQL Server Availability Group in SQL Server Standard Edition) while preserving their database configurations and minimizing downtime.

Read on for a step-by-step process and a few hints on configuration.

Comments closed

Thoughts on Creating Databases in Contained Availability Groups

Andreas Wolter digs into some updated functionality:

First of all: It’s always encouraging to see the Product team act on user feedback. SQL Server 2025 CU1 introduces an improvement that allows the creation and restoration of databases within contained availability groups (CAG). This is a step in the right direction, but as you’ll see, there are still some bumps to smooth out. Keep the feedback coming (here: Allow creation and restore of databases in contained availability group) — progress is happening, but we’re not quite there yet.

Read on for a litany of issues, as well as Andreas’s recommended solutions.

Comments closed

SIDs and Distributed Availability Groups

Evan Corbett troubleshoots an issue:

After building a contained availability group in SQL Server, a customer was experiencing intermittent issues connecting to their primary database. Our investigation revealed that the SQL Authentication login being used had been created both within the context of the contained AG as well as directly on the primary node but had different SIDs in each location.  

This is a pretty common issue when using SQL authentication, and it always seems to bite at the least opportune times.

Comments closed

Checking SQL Server Availability Groups

Jeff Iannucci announces a new procedure:

SQL Server Availability Groups can be a great feature to help support your High Availability needs, but what happens when they fail to work as expected?

Do you have an expiring certificate on used by an endpoint? Do you have timeout settings that could contribute to unexpected failovers? Are you suffering from a high number of HADR_SYNC_COMMIT waits?

We’ve seen all those things happen, and like Marvin Gaye we’ve wondered: what’s going on? And we’ve wanted a tool to help us see if other clients were having these problems, and more.

Read on for more information and check it out yourself.

Comments closed

Troubleshooting a Distributed Availability Group Failure

Jordan Boich digs in:

To give some background on this topology, they have a DAG comprised of two individual AGs. Their Global Primary AG (we’ll call this AG1) has three replicas, while their Forwarder AG (we’ll call this AG2) has two replicas. The replicas in AG1 are all in the same subnet and all Azure VMs. The replicas in AG2 are all in their own same subnet and all Azure VMs.

By the time we got logged in and connected, the Global Primary Replica was online and was able to receive connections. The secondary replicas in the Global Primary AG however, were unable to communicate with the Global Primary Replica. This is problem 1. The other secondary problem is that several databases on the Forwarder Primary Replica were in crash recovery. This is problem 2. Considering problem 2 was happening in the DR environment, that was put aside for the time being.

Read on for the troubleshooting process and solution.

Comments closed

Log Truncation in Distributed Availability Groups

Paul Randal phones a friend:

This is a question that came in through email from a prior student, which I’ll summarize as: with a distributed availability group, what are the semantics of log truncation?

As I’m not an AG expert (I just know enough to be dangerous ha-ha), I asked Jonathan to jump in, as he is most definitely an AG expert! I’ll use AG for availability group and DAG for distributed availability group in the explanation below. Whether the AGs in question have synchronous or asynchronous replicas is irrelevant for this discussion.

Read on for the answer.

Comments closed