Troubleshooting Availability Group Performance Dips

Simon Su walks us through a customer scenario where transactions per second would drop several orders of magnitude for a second, and then jump back up to normal:

The “Transaction Delay” value is an accumulation of the delay of all the current transaction delay in millisecond. You can see that the “Transaction Delay” counter has the same spikes as the sudden drop of the “Transactions Created/Sec”. Its spikes indicate that at those time points the AG transactions have time delay during commits.  This gives us a very good start point. We can focus on the transaction delay in our AG performance troubleshooting.

So who causes the transaction delay? Is it primary replica, secondary replica, or other factors like network traffic?

As a must go-through step for performance troubleshooting we captured performance monitor logs to check how the performance behaved on both replicas.  We want to find out whether there is any performance bottleneck existing in primary or secondary. For example, whether CPU usage is high when transaction delay spike happens, whether disk queue length is long, disk latency is large, etc.  We expect to find something that has the same spike trend as the “Transaction Created/sec” or “Transaction Delay”. Unfortunately, we do not anything interesting. CPU usage is as low 30%, Disk speed is quite fast. No disk queue length at all. We then checked AG related counters, like the log send queue and the recovery queue as the above two links mentioned but again we do not find anything helpful.

At the endpoint, there’s a reminder that you should keep up to date on patching systems.

Related Posts

Lazy Log Truncation

Paul Randal explains why Virtual Log Files might remain in status 2 even after they are cleared: Earlier this year I was sent an interesting question about why the person was seeing lots of VLFs in the log with status = 2 (which means ‘active’) after clearing (also known as ‘truncating’) the log and log_reuse_wait_desc showed NOTHING. I […]

Read More

Removing A Database From An AG

Erin Stellato walks us through the process of removing a database from an Availability Group and then re-adding it later: I recently had a scenario in a two-node Availability Group where multiple large-batch modification queries were executed and created a large redo queue on the replica.  The storage on the replica is slower than that […]

Read More

Categories

April 2018
MTWTFSS
« Mar May »
 1
2345678
9101112131415
16171819202122
23242526272829
30