Category: Availability Groups

I recently had a scenario in a two-node Availability Group where multiple large-batch modification queries were executed and created a large redo queue on the replica. The storage on the replica is slower than that on the primary (not a desired scenario, but it is what it is) and the secondary has fallen behind before, but this time it was to the point where it made more sense remove the database from the replica and re-initialize, rather than wait several hours for it to catch up. What I’m about detail is not an ideal solution. In fact, your solution should be architected to avoid this scenario entirely (storage of equal capability for all involved nodes is essential). But, stuff happens (e.g., a secondary database unexpectedly pausing), and the goal was to get the replica synchronized again with no downtime.

Click through for the demo.

Comments closed

New(ish) VLF Status: 4

Published 2018-09-20 by Kevin Feasel

Paul Randal points out a new VLF status which can appear if you’re using an Availability Group:

At least since I started working on the SQL Server team (just after 7.0 shipped) and since then there have only been two VLF status codes:

0 = the VLF is not active (i.e. it can be (re)activated and overwritten)

(1 = not used and no-one seems to remember what it used to mean)

2 = the VLF is active because at least one log record in it is ‘required’ by SQL Server for some reason (e.g. hasn’t been backed up by a log backup or scanned by replication)

A few weeks ago I learned about a new VLF status code that was added back in SQL Server 2012 but hasn’t come to light until recently (at least I’ve never encountered it in the wild). I went back-and-forth with a friend from Microsoft (Sean Gallardy, a PFE and MCM down in Tampa) who was able to dig around in the code to figure out when it’s used.

Read on to uncover the mysteries of the VLF status of 4.

Comments closed

When Secondary Nodes Affect Primaries

Published 2018-09-13 by Kevin Feasel

Dmitri Korotkevitch shows that a readable secondary in an Availability Group can negatively impact the primary:

A long time ago in a galaxy far, far away, I had to troubleshoot interesting performance issue in SQL Server. Suddenly, the CPU load on the server started to climb up. Nothing changed in terms of workload. The system still processed the same amount of requests. The execution plans of the critical queries stayed the same. Nevertheless, the CPU usage grew up slowly and steadily by a few percent per hour.

Eventually, we nailed it down. The problem occured in very busy OLTP system with very volatile data. We noticed that system performed much more I/O (logical and physical) than before. It was very strange, because nothing should have changed that day. Finally, we found that we have large number of deleted rows in the database that had not been cleaned up by ghost cleanup task.

Read on to learn what caused this mess.

Comments closed

A Tool For Analyzing AG Replica Latency

Published 2018-09-13 by Kevin Feasel

Simon Su has an interesting tool available:

I wrote an article to discuss about data movement latency between AG groups:

Now I develop a tool to analyze AG log block movement latency between replicas and create report accordingly.

Click through for more info and check it out on Github.

Comments closed

Testing AG Read-Only Routing

Published 2018-09-10 by Kevin Feasel

Jess Pomfret shows us how we can use dbatools to test Availability Group read-only routing:

The other part I needed to set up was read-only routing, this enables SQL Server to reroute those read only connections to the appropriate replica. You can also list the read only replicas by priority if you have multiple available or you can group them to enable load-balancing.

Although this seems to be setup correctly so that connections that specify their application intent of read only will be routed to the secondary node I wanted to prove it.

Read on to see how Jess is able to prove it.

Comments closed

HA/DR With Azure SQL Database

Published 2018-08-30 by Kevin Feasel

James Serra explains the High Availability and Disaster Recovery options available when using Azure SQL Database:

High Availability (HA) – Keeping your database up 100% of the time with no data loss during common problems. Redundancy at system level, focus on failover, addresses single predictable failure, focus is on technology. SQL Server IaaS would handle this with:

Always On Failover cluster instances

Always On Availability Groups (in same Azure region)

SQL Server data files in Azure

Disaster Recovery (DR) – Protection if major disaster or unusual failure wipes out your database. Use of alternate site, focus on re-establishing services, addresses multiple failures, includes people and processes to execute recovery. Usually includes HA also. SQL Server IaaS would handle this with:

Log Shipping
Database Mirroring
Always On Availability Groups (different Azure regions)
Backup to Azure

Click through for more details.

Comments closed

Creating A Basic Availability Group

Published 2018-08-10 by Kevin Feasel

Ginger Keys shows us two methods for creating Basic Availability Groups in SQL Server 2016 and later:

AlwaysOn Basic Availability Groups (BAGs) are available with SQL Server 2016 and 2017 Standard edition. The functionality is generally the same as database mirroring (which has been deprecated). This feature replicates transactions to a database on a secondary server, and is useful for disaster recovery should something happen to the primary server.

If you have a database that requires an extra layer of protection or ‘BAG of tricks’, deploying a Basic Availability Group is useful for providing disaster recovery and high availability for the one database. Also there is major cost savings since it is not necessary to purchase SQL Enterprise edition…this can be done in Standard edition of SQL Server.

BAGs provide a failover environment for only one database, and there can only be two replicas in the group. Replication can be synchronous or asynchronous, and there is no read access, no backups and no integrity checks on the secondary. The secondary replica remains inactive unless there is a failover, and Basic AGs can remain on-premises or span from on-prem to Azure.

Read on for the two methods.

Comments closed

Improving AG Database Level Failover

Published 2018-07-26 by Kevin Feasel

Sourabh Agarwal announces improvements to Availability Groups when it comes to database level failover:

In addition to the existing checks, the new implementation has the following additional checks.

The new implementation stores and uses a historical snapshot of the database state information to decide if a failover should be initiated. The health check routine caches the database state and associated error information, for the last two executions, which is then compared with the state information from the current execution of the health detection routine. If the same error condition (for the below mentioned error codes) exists in three consecutive runs of the health detection routine, a failover is initiated. This implementation is intended to provide safeguards against transient errors and issues which can be fixed by the auto page repair capabilities of the availability groups.
The new implementation checks for following additional errors. Majority of these errors are indicative of a hardware issues on the server. Please note, that this is not an exhaustive list of errors which could impact the database availability. There is an outstanding item to include error 824 to this list.

Great news from the Tiger Team.

Comments closed

HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING Wait Type

Published 2018-07-16 by Kevin Feasel

Chirag Shah explains what the HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING wait type really means:

Recently a customer reported an interesting issue, while querying against recently added readable replica, SELECT statement is shown as suspended and session is shown as waiting on HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

[…]

Upon more investigation, it appeared to be waiting on with a wait type HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

The behavior is by design as mention in the SQL Server product documentation and applicable to all version of SQL Server that supports availability group.

Read on for the explanation.

Comments closed

Read-Scale Availability Groups

Published 2018-07-06 by Kevin Feasel

Ryan Adams explains how to create a Read-Scale Availability Group:

A Read-Scale Availability Group is a Clusterless Availability Group. It’s sole purpose and design is to scale out a read workload. More importantly is what it is not. It is NOT a High Availability or Disaster Recovery solution. Since this design has no cluster under it, you lose things like automatic failover and database level health detection. For example, You have reports that run for customers that are in your DMZ that is fire-walled off from your internal network. Opening up ports for Active Directory so that you can have a cluster means opening a ton of ephemeral ports and ports with high attack vectors. Remember the Slammer worm? This solution removes those dependencies.

Click through for the setup scripts as well as a video Ryan created of him putting it all together. As long as you recognize the trade-offs involved, this can be a nice solution to certain problems.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31