If the primary replica on the secondary AG (also known as the “forwarder”) is lost and causes automatic failover to happen or manual failover is performed in the secondary AG in a DAG, there will be no data loss if the following conditions are met:
– The primary replica on the primary AG runs with no synchronization issue when the failover happens;
– The secondary AG on the DAG has a functioning secondary replica before the failover happens;
– The primary replica on the primary AG can communicate properly with the secondary replica on the secondary AG over their database mirroring endpoints.
There’s a lot of detail here, so if you are supporting Distributed Availability Groups, check it out.
A question came across my inbox this week which I decided to investigate: what happens if you have read only filegroups as part of your database, and you want to use Always On Availability Groups? Let’s find out.
First, I have a database with two filegroups: one read write (PRIMARY) and one read only (ROFG).
Click through for a demonstration of this, as well as a different scenario in which you might want only the read-write data on the secondary.
Now, I’m pretty sure that most of us will have been in the position where, after a fail-over we get inundated with calls, emails, Skype messages and carrier pigeon drops letting us know that so and so can no longer access the database.
When you look into it, you either find that the login never existed in the first place, so you create it or that it was there but the database user has become orphaned from it (happens when the login SID doesn’t match the SID of the database user, Adrian wrote about orphaned users in Dude where’s my access?).
You remap the orphaned user and everything is good again… that is until the next time you failover and once again you’ll be hit with the same orphaned user problem.
Click through for the explanation and a permanent fix for this issue.
Archive the data for historical analysis
One of the production DBAs pointed out that having gathered the information, it would be useful to hold it for better analysis of repeated issues. I have added an archiving step so that when the tools runs, if there is already data in the data gathering folder, it will copy that to an archive folder and name it with the date and time that the cluster log was created as this is a good estimation of when the analysis was performed. If an archive folder location is not provided it will create an archive folder in the data folder. This is not an ideal solution though, as the utility will copy all of the files and folders from there to its own location so it is better to define an archive folder in the parameters.
There are several improvements in here, so check them out.
The first step in using the utility is to configure the configuration.json file to include the location of the data files and the details of the availability group for which analysis is being done. For a correct analysis, all replicas in the availability groups needs to be listed in the configuration file.
The next step is to capture the various logs from each of the replicas and add those under the data folder. The following files are required for the analysis.
SQL error logs
Always On Availability Groups Extended Event Logs
System Health Extended Event Logs
Windows cluster log
This looks interesting.
Earlier this year I was sent an interesting question about why the person was seeing lots of VLFs in the log with status = 2 (which means ‘active’) after clearing (also known as ‘truncating’) the log and log_reuse_wait_desc showed NOTHING.
I did some digging around and all I could find was an old blog post from 2013 that shows the behavior and mentions that this happens with mirroring and Availability Groups. I hadn’t heard of this behavior before but I guessed at the reason, and confirmed with the SQL Server team.
Read on for the answer.
I recently had a scenario in a two-node Availability Group where multiple large-batch modification queries were executed and created a large redo queue on the replica. The storage on the replica is slower than that on the primary (not a desired scenario, but it is what it is) and the secondary has fallen behind before, but this time it was to the point where it made more sense remove the database from the replica and re-initialize, rather than wait several hours for it to catch up. What I’m about detail is not an ideal solution. In fact, your solution should be architected to avoid this scenario entirely (storage of equal capability for all involved nodes is essential). But, stuff happens (e.g., a secondary database unexpectedly pausing), and the goal was to get the replica synchronized again with no downtime.
Click through for the demo.
At least since I started working on the SQL Server team (just after 7.0 shipped) and since then there have only been two VLF status codes:
- 0 = the VLF is not active (i.e. it can be (re)activated and overwritten)
- (1 = not used and no-one seems to remember what it used to mean)
- 2 = the VLF is active because at least one log record in it is ‘required’ by SQL Server for some reason (e.g. hasn’t been backed up by a log backup or scanned by replication)
A few weeks ago I learned about a new VLF status code that was added back in SQL Server 2012 but hasn’t come to light until recently (at least I’ve never encountered it in the wild). I went back-and-forth with a friend from Microsoft (Sean Gallardy, a PFE and MCM down in Tampa) who was able to dig around in the code to figure out when it’s used.
Read on to uncover the mysteries of the VLF status of 4.
A long time ago in a galaxy far, far away, I had to troubleshoot interesting performance issue in SQL Server. Suddenly, the CPU load on the server started to climb up. Nothing changed in terms of workload. The system still processed the same amount of requests. The execution plans of the critical queries stayed the same. Nevertheless, the CPU usage grew up slowly and steadily by a few percent per hour.
Eventually, we nailed it down. The problem occured in very busy OLTP system with very volatile data. We noticed that system performed much more I/O (logical and physical) than before. It was very strange, because nothing should have changed that day. Finally, we found that we have large number of deleted rows in the database that had not been cleaned up by ghost cleanup task.
Read on to learn what caused this mess.
Simon Su has an interesting tool available:
Now I develop a tool to analyze AG log block movement latency between replicas and create report accordingly.
Click through for more info and check it out on Github.