I had a 2-node availability group (AG) + fileshare witness system experience an unexpected failover recently.
The synchronous secondary was being patched, and when it came back up from a reboot, the current primary unexpectedly failed over. We weren’t done with all the patching on the secondary, so this caused a short outage, and we had to fail back to the original primary to finish the patching (which is of course another short interruption in availability).
The root cause was interesting enough that I decided to share the story here, and provide some general advice and debugging tips along the way.
Click through to understand why this happened and how you might be able to avoid the pain Josh experienced.