Improving The Azure Automated AG Experience

Allan Hirt would like to see a few improvements to the experience when creating availability groups on Azure VMs:

What does this mean? To have a supported WSFC-based configuration (doesn’t matter what you are running on it – could be something non-SQL Server), you need to pass validation. xFailOverCluster does not allow this to be run. You can create the WSFC, you just can’t validate it. The point from a support view is that the WSFC has to be vetted before you create it. Could you run it after? Sure, but you still have no proof you had a valid configuration to start with which is what matters. This is a crucial step for all AGs and FCIs, especially since AGs do not check this whereas the installation process for FCIs does.

If you look at MSFT_xCluster, you’ll see what I am saying is true. It builds the WSFC without a whiff of Test-Cluster. To be fair, this can be done in non-Azure environments, too, but Microsoft givs you warnings not to do that for good reason. I understand why Microsoft did it this way. There is currently no tool, parser, or cmdlet to examine the output of Test-Cluster results. This goes back to why building WSFCs is *very* hard to automate.

I’m not sure how easy some of these fixes would be, but they’d definitely be nice.

When Synchronous AG Secondaries Are Out Of Sync

David Fowler explains that just because an Availability Group is set up as synchronous, doesn’t mean you can never experience data loss on failover:

The primary replica is constantly monitoring the state of it’s secondaries. With the use of a continuous ping, the primary node always knows if the secondaries are up or down.

It’s when SQL detects that one of it’s synchronous replicas goes offline is when interesting things can happen.

So here’s the discussion that came up, if a synchronous replica goes offline for whatever reason, SQL won’t be able to commit any transactions and that means we can be confident that the secondary is up to date, right?

Read on to learn the answer. Which is “no.” But David explains why, so you should read that instead of just having me say it.

When A Database In An AG Has Different Query Store Settings

Erin Stellato explains how we can have a discrepancy in Query Store settings between the primary and a secondary in an Availability Group:

Last week there was a question on #sqlhelp on Twitter about the status of Query Store for a database in an Availability Group. I’ve written about Query Store and Availability Groups before so if you’re not familiar with QS behavior in an AG, check out that post first. But this question was, I think, specific to the values that shows on a read-only replica and how there were different query store settings between a primary and secondary. Let’s set it up and take a look.

Click through to learn why this may be and why you shouldn’t panic.

The Siren’s Call Of AG Readable Secondaries

Jonathan Kehayias brings up some of the reasons he tends to recommend customers keep away from using readable secondaries on Availability Groups:

A lot of the implementation questions I get for Availability Groups are generally looking at leveraging a secondary replica for High Availability or Disaster Recovery and often times also for offloading reporting or readable workloads.  Readable Secondary copies of the database are really simple at the surface conceptually, but there are a lot of additional considerations I make that could lead you to another technology entirely like Transactional Replication or Log Shipping with Standby. Let’s look at how these three stack up against each other for different data access requirements.

There are some good reasons here not to kick replication and log shipping to the curb.

What Happens When Your Secondary DAG Fails Over

Tianyu Wen explains what happens when there is a failover incident in the secondary Availability Group of a Distributed Availability Group:

If the primary replica on the secondary AG (also known as the “forwarder”) is lost and causes automatic failover to happen or manual failover is performed in the secondary AG in a DAG, there will be no data loss if the following conditions are met:
– The primary replica on the primary AG runs with no synchronization issue when the failover happens;
– The secondary AG on the DAG has a functioning secondary replica before the failover happens;
– The primary replica on the primary AG can communicate properly with the secondary replica on the secondary AG over their database mirroring endpoints.

There’s a lot of detail here, so if you are supporting Distributed Availability Groups, check it out.

Availability Groups And Read-Only Filegroups

Allan Hirt walks us through a couple of scenarios involving databases with read-only filegroups using Always On Availability Groups:

A question came across my inbox this week which I decided to investigate: what happens if you have read only filegroups as part of your database, and you want to use Always On Availability Groups? Let’s find out.
First, I have a database with two filegroups: one read write (PRIMARY) and one read only (ROFG).

Click through for a demonstration of this, as well as a different scenario in which you might want only the read-write data on the secondary.

Cross-Availability Group Login Management

David Fowler walks us through a problem about orphaned users and Availability Groups:

Now, I’m pretty sure that most of us will have been in the position where, after a fail-over we get inundated with calls, emails, Skype messages and carrier pigeon drops letting us know that so and so can no longer access the database.

When you look into it, you either find that the login never existed in the first place, so you create it or that it was there but the database user has become orphaned from it (happens when the login SID doesn’t match the SID of the database user, Adrian wrote about orphaned users in Dude where’s my access?).

You remap the orphaned user and everything is good again…  that is until the next time you failover and once again you’ll be hit with the same orphaned user problem.

Click through for the explanation and a permanent fix for this issue.

Improvements To The SQL Server Availability Group Failover Detection Utility

Rob Sewell has a few improvements to the SQL Server Availability Group Failover Detection Powershell function:

Archive the data for historical analysis

One of the production DBAs pointed out that having gathered the information, it would be useful to hold it for better analysis of repeated issues. I have added an archiving step so that when the tools runs, if there is already data in the data gathering folder, it will copy that to an archive folder and name it with the date and time that the cluster log was created as this is a good estimation of when the analysis was performed. If an archive folder location is not provided it will create an archive folder in the data folder. This is not an ideal solution though, as the utility will copy all of the files and folders from there to its own location so it is better to define an archive folder in the parameters.

There are several improvements in here, so check them out.

Root Cause Discovery For Availability Group Failovers

Sourabh Agarwal announces a new tool to determine why your Availability Group failed over:

The first step in using the utility is to configure the configuration.json file to include the location of the data files and the details of the availability group for which analysis is being done. For a correct analysis, all replicas in the availability groups needs to be listed in the configuration file.

The next step is to capture the various logs from each of the replicas and add those under the data folder. The following files are required for the analysis.

  • SQL error logs

  • Always On Availability Groups Extended Event Logs

  • System Health Extended Event Logs

  • System log

  • Windows cluster log

This looks interesting.

Lazy Log Truncation

Paul Randal explains why Virtual Log Files might remain in status 2 even after they are cleared:

Earlier this year I was sent an interesting question about why the person was seeing lots of VLFs in the log with status = 2 (which means ‘active’) after clearing (also known as ‘truncating’) the log and log_reuse_wait_desc showed NOTHING.

I did some digging around and all I could find was an old blog post from 2013 that shows the behavior and mentions that this happens with mirroring and Availability Groups. I hadn’t heard of this behavior before but I guessed at the reason, and confirmed with the SQL Server team.

Read on for the answer.

Categories

March 2019
MTWTFSS
« Feb  
 123
45678910
11121314151617
18192021222324
25262728293031