Category: Availability Groups

Preventing Unexpected Failover with AG Patching

Published 2020-05-27 by Kevin Feasel

Josh Darnell dealt with an issue with Availability Groups:

I had a 2-node availability group (AG) + fileshare witness system experience an unexpected failover recently.
The synchronous secondary was being patched, and when it came back up from a reboot, the current primary unexpectedly failed over. We weren’t done with all the patching on the secondary, so this caused a short outage, and we had to fail back to the original primary to finish the patching (which is of course another short interruption in availability).
The root cause was interesting enough that I decided to share the story here, and provide some general advice and debugging tips along the way.

Click through to understand why this happened and how you might be able to avoid the pain Josh experienced.

Comments closed

Thread Pool Exhaustion and Availability Groups

Published 2020-05-13 by Kevin Feasel

Sean Gallardy lays down the gauntlet:

You’re probably wondering why you couldn’t spawn a new thread, why this error happened, why you shouldn’t just up the max worker threads, and probably also thinking this must be some kind of “bug” in SQL Server. So here’s where our awkward conversation starts… It’s you. Contrary to every relationship breakup you’ve ever had, it’s definitely you. I’m not saying this to be mean but to really drive the point home here. The major reasons for this occurring are large blocking chains, too much workload for the server size (databases, users, etc.), and/or your virtual infrastructure sucks. There aren’t too many reasons for getting yourself into this situation, and while what I’ll be putting forth here isn’t exhaustive of all edge cases and scenarios, these are by far the majority of all the items in the wild that I’ve either worked on or have been involved in at some level. Side Note: If you’ve read this far, are shaking your head, calling me names that an irate sailor might utter, and telling yourself that upping the max worker threads as the product error suggests and Microsoft should fix their bugs then you can stop reading here as you’re probably not open to learning why you have issues in your environments.

One more scenario I’ve seen is mirroring thousands of databases on a single instance. That scenario fit none of Sean’s criteria—there was very little blocking, most of the databases were small and infrequently-used, and the infrastructure was the right size. It was just a huge number of databases and each database requiring a minimum of X worker threads. Mind you, it was still a bad idea…

Comments closed

Don’t Use sys.dm_hadr_cluster_members for Quorum Info

Published 2020-05-11 by Kevin Feasel

Sean Gallardy explains a limitation of sys.dm_hadr_cluster_members:

I’ve now run across a few different instances where the monitoring for quorum was done via this DMV. On the surface, it seems like nothing would be wrong with using the “number_of_quorum_votes” column to check on the members of the cluster and see their voting status. However, this isn’t quite the case… you see there are various mechanisms that influence whether or not a member (or witness) has a vote and these continue to be expanded in each version of WSFC.

Click through for a short history lesson as well as some good advice on how accurately to get this information.

Comments closed

Availability Group Offerings in Standard Edition

Published 2020-05-01 by Kevin Feasel

Guy Glantser notes an issue with Availability Group documentation:

In SQL Server 2017 Microsoft added a new flavor called Read-Scale Availability Groups. This is different, because the goal here is not high availability or disaster recovery, but rather read-scalability. As opposed to the other flavors, in RSAG there is no cluster, and there is also no automatic failover mechanism. But you can set up multiple secondary replicas with read-only access and load balancing, and offload read workloads from the primary replica. This is a great scalability feature, and you can read more about it here.
Now, if you check Microsoft documentation regarding the editions and supported features of SQL Server, you will be happy to see that RSAG is supported in Standard Edition. I was happy to see it too. Unfortunately, if you try to set up a Read-Scale Availability Group on Standard Edition, it will not work. You will only be able to create a Basic Availability Group, as discussed earlier.

Click through for the answer, as well as what you can do in Standard Edition.

Comments closed

Managing Jobs in Availability Groups

Published 2020-03-27 by Kevin Feasel

Goncalo Cruz has a plan to run SQL Agent jobs on the primary node in an availability group:

In SQL Availability Groups the SQL jobs have to be created in all replicas and you need to add logic at the beginning of each relevant job to make it execute on the primary database. (this only applies when the local replica is the primary for the database)
If you do not add the logic they will execute with success in the primary replica but they will fail in the secondary replica.

Read on for a process which keeps jobs from running except on the primary.

Comments closed

Monitoring Availability Groups

Published 2020-02-25 by Kevin Feasel

Nisarg Upadhyay gives us some of the low-down on monitoring availability groups:

In my previous articles, I have explained the step-by-step process of deploying an AlwaysOn Availability group on SQL Server 2017. In this article, I am going to explain how to monitor AlwaysOn availability groups.
First, let’s review the configuration of the availability group we had deployed previously. To do that, open SQL Server Management Studio  Expand database engine from the object explorer  Expand “AlwaysOn High Availability”  Expand “Availability Groups.” You can see the availability group named SQLAAG. Under this availability group (SQLAAG), you can see the list of availability replicas, availability databases, and availability group listeners.

Click through for some tooling built into SQL Server Management Studio, as well as relevant Perfmon counters.

Comments closed

Ghost Records and Availability Groups

Published 2020-01-30 by Kevin Feasel

Aaron Bertrand ran into an interesting problem:

I recently came across an issue in a SQL Server Availability Group scenario where queries against a heavily-used queue table were taking longer and longer over time. The symptoms of the query were that logical reads were increasing rapidly, and we ultimately tracked it down to ghost records (and version ghost records) that were being created as rows from the queue table were consumed and deleted. Because the database was being used in a readable secondary, the ghost cleanup process simply wasn’t able to keep up with the volume of deletes against the table. The first workaround implemented was to suspend the queue consumers and rebuild the table.

Read on to see how Aaron used filtered indexes and soft deletes to mitigate some of the pain.

Comments closed

PARALLEL_REDO_FLOW_CONTROL Waits on Availability Groups

Published 2019-12-10 by Kevin Feasel

Taryn Pratt goes through a short outage at Stack Overflow:

While I can’t be 100% sure of the trigger, I’m 99.9% sure, because the job was running before the outage, so the timing is right. After looking through our monitoring logs, everything pointed to the job being the cause, so yes, I’m confident it caused it.
We don’t have regular maintenance windows for any of our servers, so we run jobs throughout the week, and if possible, try to schedule them during low-usage times. In this case, the job was an index maintenance job.
Now, before you scream at me about running an index maintenance job, I’m not going to argue the pros and cons of using it or whether or not we should run it — we can do that at another time. For this post, just accept the fact that we were running a job to rebuild/reorganize indexes

This is an interesting after-analysis of an outage. I have a lot of respect for people who can put these together and make them public—I would have a lot of trouble doing that myself.

Comments closed

High Availability Changes in SQL Server 2016

Published 2019-11-26 by Kevin Feasel

Martin Surasky looks at what SQL Server 2016 changed with respect to High Availability options:

AlwaysOn Availability Groups, first introduced in SQL Server 2012 is a feature that is conceptually similar to database mirroring. I’m going to assume you already know what AlwaysOn Availability Groups are in general, their main purpose and how they are different (essential aspects at least) from other technologies to provide replication such as Database Mirroring.
In SQL Server 2014, the significant enhancement to availability groups was the increase in the number of supported secondary replicas from three to eight. SQL Server 2016 includes a number of new enhancements

The biggest thing about it, as I recall, was stability: I wouldn’t have recommended too many places go into production with Availability Groups in 2012, but by 2016, many of the biggest bugs were ironed out.

Comments closed

Creating a Failover Cluster Instance with Shared Storage

Published 2019-11-21 by Kevin Feasel

Ryan Adams wraps up a video series on setting up a SQL Server lab environment:

You are going to create a SQL Server Failover Cluster Instance in Part 4 of our series on how to build a SQL Cluster Lab. The FCI will only be installed on Node1 and Node2. FCIs require shared storage so you will make your domain controller an iSCSI target. Last you will create your FCI using the iSCSI drives you presented to the cluster.

Click through for links to the entire series.

Comments closed