Recently a customer reported an interesting issue, while querying against recently added readable replica, SELECT statement is shown as suspended and session is shown as waiting on HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING
Upon more investigation, it appeared to be waiting on with a wait type HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING
The behavior is by design as mention in the SQL Server product documentation and applicable to all version of SQL Server that supports availability group.
Read on for the explanation.
A Read-Scale Availability Group is a Clusterless Availability Group. It’s sole purpose and design is to scale out a read workload. More importantly is what it is not. It is NOT a High Availability or Disaster Recovery solution. Since this design has no cluster under it, you lose things like automatic failover and database level health detection. For example, You have reports that run for customers that are in your DMZ that is fire-walled off from your internal network. Opening up ports for Active Directory so that you can have a cluster means opening a ton of ephemeral ports and ports with high attack vectors. Remember the Slammer worm? This solution removes those dependencies.
Click through for the setup scripts as well as a video Ryan created of him putting it all together. As long as you recognize the trade-offs involved, this can be a nice solution to certain problems.
Yes, you may have an availability group – well done – and you may have installed SSRS on both servers. But you’ve only set up the reporting application to point to one of those? And you’ve given the link
https://<<Listener_Name>>/reportsout to the users? Head/desk. I told you at the time that SSRS doesn’t play nicely with AGs. [Nearly misposted as SSRS doesn’t play nicely with SSRS, which, while valid, isn’t the point here…]
Here’s what you need to do to fix this / make sure it doesn’t happen:
Click through to learn what you need to do to make sure there are no problems.
Microsoft released Availability Groups (AG) as a feature in SQL Server 2012. Prior to SQL Server 2016, there were two methods of adding a database to a new AG replica.
- You could provide the Add Database to Availability Group wizard a file share accessible by the primary and secondary replicas. SQL Server would run FULL and LOG backups of each database to the share and use them to restore the database(s) to each replica.
- You could manually run a FULL and LOG backup of each database, copy the backup files to each replica, and restore the databases WITH NORECOVERY.
With SQL Server 2016. Microsoft has provided a third option, Automatic Seeding. With Automatic Seeding, you specify the databases and the replicas and SQL Server will begin transferring data to each replica. The duration of the seeding process depends on the size of the database and the network bandwidth available between primary and secondary replica.
Automatic seeding isn’t perfect, but it’s quite useful.
When availability group was initially released with SQL Server 2012, the transaction log redo was handled by a single redo thread for each database in an AG secondary replica. This redo model is also called as serial redo. In SQL Server 2016, the redo model was enhanced with multiple parallel redo worker threads per database to share the redo workload. In addition, each database has a new helper worker thread for handling the dirty page disk flush IO. This new redo model is called parallel redo.
With the new parallel redo model that is the default setting since SQL Server 2016, workloads with highly concurrent small transactions are expected to achieve better redo performance. When the transaction redo operation is CPU intensive, such as when data encryption and/or data compression are enabled, parallel redo has even higher redo throughput (Redone Bytes/sec) compared to serial redo. Moreover, indirect checkpoint allows parallel redo to offload more disk IO (and IO waits for slow disk) to its helper worker thread and frees main redo thread to enumerate more received log records in secondary replica. It further speeds up the redo performance.
Read on to learn more about these two models, including positives and negatives for each and how to switch from one to the other.
First and foremost, SQL Server 2016 Service Pack 2 was just released today. There are two major improvements in it for AGs:
1. SQL Server 2016 now has full Microsoft Distributed Transaction Coordinator (DTC) support. SQL Server 2016 had partial support for DTC with one of the two scenarios (cross instance/cross platform), but not intra-instance DBs. SQL Server 2017 had both, and now that was backported so SQL Server 2016 supports all DTC scenarios with AGs. This is great news.
Click through for the other major improvement. This is in addition to yesterday’s notice regarding the distribution database.
Many enterprise customers have asked the capability to combine the usage of SQL Server replication and Always On, such that they can place replication distribution databases within an Always On AG to achieve high availability for their distribution databases, with the expectation that after doing so and when AG failover happens, SQL Server replication will continue functioning seamlessly and correctly. While the Replication publication and subscription databases can be configured to use Availability Groups, this support was lacking for the replication Distribution Databases.
SQL Server engineering team is excited to announce the new enhancement around the configuration of Replication Distribution Database in an Availability Group. This feature enhancement would be available with SQL Server 2017 CU6 and will be ported to SQL Server 2016 in a subsequent CU for SP2.
For those people using merge replication, it won’t be supported in this release.
The “Transaction Delay” value is an accumulation of the delay of all the current transaction delay in millisecond. You can see that the “Transaction Delay” counter has the same spikes as the sudden drop of the “Transactions Created/Sec”. Its spikes indicate that at those time points the AG transactions have time delay during commits. This gives us a very good start point. We can focus on the transaction delay in our AG performance troubleshooting.
So who causes the transaction delay? Is it primary replica, secondary replica, or other factors like network traffic?
As a must go-through step for performance troubleshooting we captured performance monitor logs to check how the performance behaved on both replicas. We want to find out whether there is any performance bottleneck existing in primary or secondary. For example, whether CPU usage is high when transaction delay spike happens, whether disk queue length is long, disk latency is large, etc. We expect to find something that has the same spike trend as the “Transaction Created/sec” or “Transaction Delay”. Unfortunately, we do not anything interesting. CPU usage is as low 30%, Disk speed is quite fast. No disk queue length at all. We then checked AG related counters, like the log send queue and the recovery queue as the above two links mentioned but again we do not find anything helpful.
At the endpoint, there’s a reminder that you should keep up to date on patching systems.
When I have an issue with tempdb filling up the first thing that I usually do is try to figure out exactly what the space has been allocated to.
You can quickly figure out what process has the most space allocated by using a quick query against dm_db_session_space_usage.SELECT session_id, database_id, user_objects_alloc_page_count + internal_objects_dealloc_page_count AS TotalAllocatedPages FROM sys.dm_db_session_space_usage ORDER BY TotalAllocatedPages DESC
But what if you can see that there aren’t any pages allocated to sessions? What could be taking up all the space? Well let’s have a little look and see exactly where those pages are allocated.
Click through to see David’s results and explanation.
The AlwaysOn_health event session in Extended Events is intended to make analyzing problems with Availability Groups possible after they have occurred. While this event session goes a long way towards making it possible to piece together the puzzle of what went wrong in a lot of situations, it can still be a difficult task. One of the things I wish Microsoft had included in the AlwaysON_health event session definition is the sqlserver.server_instance_name action for every event, and this is something that I usually recommend clients add to each of their AG servers using a script after I work with them the first time. Why would this be useful, since if the files come from a specific server we should know the events are for that server right? Well, when we are working with AG configurations with more than two servers and trying to see the big picture of what is happening across the servers, it can be difficult to follow timelines with multiple files from multiple servers open. It’s not impossible, but it does make things more difficult.
Click through to see how to do this through the UI or via T-SQL.