Category: Availability Groups

Frequently Asked Availability Group Questions

Published 2020-06-17 by Kevin Feasel

Sean Gallardy answers the voices swarming in his head:

Q3: Do my secondary replicas really need to be the same level of hardware as my primary replicas?
Sean: They should, this assumes you are eventually going to fail over to them or they might need to run your workload. It does *not* assume they will be used for readable workloads. Remember that the secondary replica is already accepting and writing log data, redoing that data, and then if you’re running a bunch of read workloads on top of that it might actually be *busier* than your primary. Yes, I’ve witnessed secondary replicas getting hammered harder than their associated primary.

Read on for informative and entertaining answers, and avoid those rusty nails.

Comments closed

SQL Agent Job Control Based on AG/Mirroring Status

Published 2020-05-28 by Kevin Feasel

Eitan Blumin has an interesting solution for us:

Lo and behold, my new and improved scripts, which implement the following logic:
Create a single Master Control Job with an hourly schedule, and also set it to run when SQL Agent is started up. The Master Control Job will do the following:
Automatically detect which jobs have steps that run on databases that are involved in an HADR solution.
For each such job, automatically detect whether there’s at least one step run on a database which currently has the Primary/Principle role.
If so, make sure it’s enabled. Otherwise, make sure it’s disabled.
Create an alert for a role/state change event, if such doesn’t exist yet, and set it to run the Master Control Job when triggered.

Click through for more details and links to scripts for Availability Group and database mirroring scenarios.

Comments closed

Preventing Unexpected Failover with AG Patching

Published 2020-05-27 by Kevin Feasel

Josh Darnell dealt with an issue with Availability Groups:

I had a 2-node availability group (AG) + fileshare witness system experience an unexpected failover recently.
The synchronous secondary was being patched, and when it came back up from a reboot, the current primary unexpectedly failed over. We weren’t done with all the patching on the secondary, so this caused a short outage, and we had to fail back to the original primary to finish the patching (which is of course another short interruption in availability).
The root cause was interesting enough that I decided to share the story here, and provide some general advice and debugging tips along the way.

Click through to understand why this happened and how you might be able to avoid the pain Josh experienced.

Comments closed

Thread Pool Exhaustion and Availability Groups

Published 2020-05-13 by Kevin Feasel

Sean Gallardy lays down the gauntlet:

You’re probably wondering why you couldn’t spawn a new thread, why this error happened, why you shouldn’t just up the max worker threads, and probably also thinking this must be some kind of “bug” in SQL Server. So here’s where our awkward conversation starts… It’s you. Contrary to every relationship breakup you’ve ever had, it’s definitely you. I’m not saying this to be mean but to really drive the point home here. The major reasons for this occurring are large blocking chains, too much workload for the server size (databases, users, etc.), and/or your virtual infrastructure sucks. There aren’t too many reasons for getting yourself into this situation, and while what I’ll be putting forth here isn’t exhaustive of all edge cases and scenarios, these are by far the majority of all the items in the wild that I’ve either worked on or have been involved in at some level. Side Note: If you’ve read this far, are shaking your head, calling me names that an irate sailor might utter, and telling yourself that upping the max worker threads as the product error suggests and Microsoft should fix their bugs then you can stop reading here as you’re probably not open to learning why you have issues in your environments.

One more scenario I’ve seen is mirroring thousands of databases on a single instance. That scenario fit none of Sean’s criteria—there was very little blocking, most of the databases were small and infrequently-used, and the infrastructure was the right size. It was just a huge number of databases and each database requiring a minimum of X worker threads. Mind you, it was still a bad idea…

Comments closed

Don’t Use sys.dm_hadr_cluster_members for Quorum Info

Published 2020-05-11 by Kevin Feasel

Sean Gallardy explains a limitation of sys.dm_hadr_cluster_members:

I’ve now run across a few different instances where the monitoring for quorum was done via this DMV. On the surface, it seems like nothing would be wrong with using the “number_of_quorum_votes” column to check on the members of the cluster and see their voting status. However, this isn’t quite the case… you see there are various mechanisms that influence whether or not a member (or witness) has a vote and these continue to be expanded in each version of WSFC.

Click through for a short history lesson as well as some good advice on how accurately to get this information.

Comments closed

Availability Group Offerings in Standard Edition

Published 2020-05-01 by Kevin Feasel

Guy Glantser notes an issue with Availability Group documentation:

In SQL Server 2017 Microsoft added a new flavor called Read-Scale Availability Groups. This is different, because the goal here is not high availability or disaster recovery, but rather read-scalability. As opposed to the other flavors, in RSAG there is no cluster, and there is also no automatic failover mechanism. But you can set up multiple secondary replicas with read-only access and load balancing, and offload read workloads from the primary replica. This is a great scalability feature, and you can read more about it here.
Now, if you check Microsoft documentation regarding the editions and supported features of SQL Server, you will be happy to see that RSAG is supported in Standard Edition. I was happy to see it too. Unfortunately, if you try to set up a Read-Scale Availability Group on Standard Edition, it will not work. You will only be able to create a Basic Availability Group, as discussed earlier.

Click through for the answer, as well as what you can do in Standard Edition.

Comments closed

Managing Jobs in Availability Groups

Published 2020-03-27 by Kevin Feasel

Goncalo Cruz has a plan to run SQL Agent jobs on the primary node in an availability group:

In SQL Availability Groups the SQL jobs have to be created in all replicas and you need to add logic at the beginning of each relevant job to make it execute on the primary database. (this only applies when the local replica is the primary for the database)
If you do not add the logic they will execute with success in the primary replica but they will fail in the secondary replica.

Read on for a process which keeps jobs from running except on the primary.

Comments closed

Monitoring Availability Groups

Published 2020-02-25 by Kevin Feasel

Nisarg Upadhyay gives us some of the low-down on monitoring availability groups:

In my previous articles, I have explained the step-by-step process of deploying an AlwaysOn Availability group on SQL Server 2017. In this article, I am going to explain how to monitor AlwaysOn availability groups.
First, let’s review the configuration of the availability group we had deployed previously. To do that, open SQL Server Management Studio  Expand database engine from the object explorer  Expand “AlwaysOn High Availability”  Expand “Availability Groups.” You can see the availability group named SQLAAG. Under this availability group (SQLAAG), you can see the list of availability replicas, availability databases, and availability group listeners.

Click through for some tooling built into SQL Server Management Studio, as well as relevant Perfmon counters.

Comments closed

Ghost Records and Availability Groups

Published 2020-01-30 by Kevin Feasel

Aaron Bertrand ran into an interesting problem:

I recently came across an issue in a SQL Server Availability Group scenario where queries against a heavily-used queue table were taking longer and longer over time. The symptoms of the query were that logical reads were increasing rapidly, and we ultimately tracked it down to ghost records (and version ghost records) that were being created as rows from the queue table were consumed and deleted. Because the database was being used in a readable secondary, the ghost cleanup process simply wasn’t able to keep up with the volume of deletes against the table. The first workaround implemented was to suspend the queue consumers and rebuild the table.

Read on to see how Aaron used filtered indexes and soft deletes to mitigate some of the pain.

Comments closed

PARALLEL_REDO_FLOW_CONTROL Waits on Availability Groups

Published 2019-12-10 by Kevin Feasel

Taryn Pratt goes through a short outage at Stack Overflow:

While I can’t be 100% sure of the trigger, I’m 99.9% sure, because the job was running before the outage, so the timing is right. After looking through our monitoring logs, everything pointed to the job being the cause, so yes, I’m confident it caused it.
We don’t have regular maintenance windows for any of our servers, so we run jobs throughout the week, and if possible, try to schedule them during low-usage times. In this case, the job was an index maintenance job.
Now, before you scream at me about running an index maintenance job, I’m not going to argue the pros and cons of using it or whether or not we should run it — we can do that at another time. For this post, just accept the fact that we were running a job to rebuild/reorganize indexes

This is an interesting after-analysis of an outage. I have a lot of respect for people who can put these together and make them public—I would have a lot of trouble doing that myself.

Comments closed