First and foremost, SQL Server 2016 Service Pack 2 was just released today. There are two major improvements in it for AGs:
1. SQL Server 2016 now has full Microsoft Distributed Transaction Coordinator (DTC) support. SQL Server 2016 had partial support for DTC with one of the two scenarios (cross instance/cross platform), but not intra-instance DBs. SQL Server 2017 had both, and now that was backported so SQL Server 2016 supports all DTC scenarios with AGs. This is great news.
Click through for the other major improvement. This is in addition to yesterday’s notice regarding the distribution database.
Many enterprise customers have asked the capability to combine the usage of SQL Server replication and Always On, such that they can place replication distribution databases within an Always On AG to achieve high availability for their distribution databases, with the expectation that after doing so and when AG failover happens, SQL Server replication will continue functioning seamlessly and correctly. While the Replication publication and subscription databases can be configured to use Availability Groups, this support was lacking for the replication Distribution Databases.
SQL Server engineering team is excited to announce the new enhancement around the configuration of Replication Distribution Database in an Availability Group. This feature enhancement would be available with SQL Server 2017 CU6 and will be ported to SQL Server 2016 in a subsequent CU for SP2.
For those people using merge replication, it won’t be supported in this release.
The “Transaction Delay” value is an accumulation of the delay of all the current transaction delay in millisecond. You can see that the “Transaction Delay” counter has the same spikes as the sudden drop of the “Transactions Created/Sec”. Its spikes indicate that at those time points the AG transactions have time delay during commits. This gives us a very good start point. We can focus on the transaction delay in our AG performance troubleshooting.
So who causes the transaction delay? Is it primary replica, secondary replica, or other factors like network traffic?
As a must go-through step for performance troubleshooting we captured performance monitor logs to check how the performance behaved on both replicas. We want to find out whether there is any performance bottleneck existing in primary or secondary. For example, whether CPU usage is high when transaction delay spike happens, whether disk queue length is long, disk latency is large, etc. We expect to find something that has the same spike trend as the “Transaction Created/sec” or “Transaction Delay”. Unfortunately, we do not anything interesting. CPU usage is as low 30%, Disk speed is quite fast. No disk queue length at all. We then checked AG related counters, like the log send queue and the recovery queue as the above two links mentioned but again we do not find anything helpful.
At the endpoint, there’s a reminder that you should keep up to date on patching systems.
When I have an issue with tempdb filling up the first thing that I usually do is try to figure out exactly what the space has been allocated to.
You can quickly figure out what process has the most space allocated by using a quick query against dm_db_session_space_usage.SELECT session_id, database_id, user_objects_alloc_page_count + internal_objects_dealloc_page_count AS TotalAllocatedPages FROM sys.dm_db_session_space_usage ORDER BY TotalAllocatedPages DESC
But what if you can see that there aren’t any pages allocated to sessions? What could be taking up all the space? Well let’s have a little look and see exactly where those pages are allocated.
Click through to see David’s results and explanation.
The AlwaysOn_health event session in Extended Events is intended to make analyzing problems with Availability Groups possible after they have occurred. While this event session goes a long way towards making it possible to piece together the puzzle of what went wrong in a lot of situations, it can still be a difficult task. One of the things I wish Microsoft had included in the AlwaysON_health event session definition is the sqlserver.server_instance_name action for every event, and this is something that I usually recommend clients add to each of their AG servers using a script after I work with them the first time. Why would this be useful, since if the files come from a specific server we should know the events are for that server right? Well, when we are working with AG configurations with more than two servers and trying to see the big picture of what is happening across the servers, it can be difficult to follow timelines with multiple files from multiple servers open. It’s not impossible, but it does make things more difficult.
Click through to see how to do this through the UI or via T-SQL.
I’ve been hearing about round-robin read-only routing ever since SQL 2016 came out but whenever I tried to test if it’s working it never seemed to be. But now I know exactly how it works and there’s a few loopholes where it may not trigger, and they’re not the documented ones you’re thinking of.
To test the limits of it you’re going to need:
- PowerShell 5.1
- Pester 4 (
Install-Module Pester -Force)
- DbData (
Install-Module DbData -Force)
I’ll explain any of the Pester and DbData bits along the way so don’t worry. They’re minor framework stuff.
There’s some good stuff here around connection pooling, so check it out.
If you happen to be managing SQL Servers with a large number of databases and availability groups, it can sometimes be difficult to keep track of which database belongs to which availability group.
sp_WhatsMyAG will tell you just that. You can either provide it with the database name and it’ll tell you the AG that your specified database belongs to or you can leave the parameter NULL and you’ll get the AG of the database whose context you’re currently in.
Click through for the script.
Jim Donahoe shows off Failover Groups in Azure SQL Database. Part 1 involves setting up a Failover Group:
In my former company, we had 22 web applications that all had connections to various databases. We had all of our databases configured for Geo-Replication already, but still if we had to failover, we had to update each connection string for the web apps along with others which became a tedious process. In came Failover Groups to the rescue! With a Failover Group, I was able to create two endpoints that stayed the same no matter which server was primary/secondary. I liked to think of these as my Availability Group Listeners as they kinda serve the same functionality: Route traffic to a node depending on if its read-only or not. Best part? It’s configured through the Azure Portal SO EASILY! You can use PowerShell as well, but for this blog post, I will walk through the creation via the Portal. I will make a separate post or attach a script at some point for the PowerShell deployment.
Before we start the configuration portion of this though, let’s take a look at how Microsoft defines what a Failover Group is. I found this definition here: “Azure SQL Database auto-failover groups (in-preview) is a SQL Database feature designed to automatically manage geo-replication relationship, connectivity, and failover at scale.” Sounds pretty interesting, right? Let’s make one!
Well, now that the easy stuff is out of the way, let’s talk about how you connect to these groups via SSMS. This is where some of the confusion happens. When I first configured a Failover Group, the first thing I tried to do was connect to the Primary server via SSMS thinking it will work just like an Always On Listener in traditional SQL Server…NEWP!
If you’re running a production database on Azure SQL Database, you might want to look at Failover Groups.
The AlwaysOn Health Events were consistent with the SQL Server Logs. I discovered that the Windows Server Failover Cluster had been down (but was back online), and that the AlwaysOn Group had failed over and back. But the culprit to my secondary replica and databases being disconnected was an endpoint issue. I found in my SQL Server logs the service account running AlwaysOn had for some reason lost its permissions to connect to the endpoint:
Click through for the entire troubleshooting process as well as the solution.
The Latency data collection functionality and the associated reports allows a database administrator to quickly discern the bottleneck in the log transport flow between the Primary and the Secondary replicas of an Availability Group. This feature does NOT answer the question “Is there latency in the Availability Group deployment?” but rather provides a way to understand why there is latency in the Availability Group Deployment. This functionality provides a way to narrow down the potential cause of latency in an Availability Group deployment.
There are some things that this report doesn’t capture, but it does give us a bit more insight.