Press "Enter" to skip to content

Category: Administration

Alerting on Long-Running SQL Queries and SQL Agent Jobs

Temidayo Omoniyi sends an e-mail:

Have you ever waited for an eternity, waiting for either a query or SQL Agent Jobs to run? This is something most Data Warehouse Developers face daily.

Click through to see how you can use database mail to track long-running tasks. My primary hang-up with solutions like this is, what are you going to do about the e-mail? If there is no concrete action you can take, the most likely outcome will be to ignore the e-mail. This makes it harder to sift out the true positive you need to look into versus the false positives that happen every day.

Leave a Comment

Time Delay for Online Checksums in PostgreSQL

Cristophe Pettus notes an upcoming change to PostgreSQL 19:

For about fifteen years the answer to “can I turn on data checksums without an initdb?” has been “not really.” pg_checksums showed up in PostgreSQL 12 and made the job survivable, but you still had to shut the cluster down. For anyone running 24×7 production, that has left the same three options: take the downtime, fail over through a checksummed replica, or live without checksums.

PostgreSQL 19 adds a fourth path. A commit from Daniel Gustafsson on April 3rd wires up online enabling and disabling of data checksums: the command completes immediately, and the cluster keeps serving traffic while a background process rewrites every heap and index page in the cluster to carry (or drop) the checksum.

Read on to see what it will do, as well as the consequences.

Leave a Comment

Unplanned Failover and SQL Server on Kubernetes

Anthony Nocentino performs additional testing:

In my planned failover walkthrough, I showed what happens when you deliberately move the primary role to another replica. That’s the easy case. Now I want to show what happens when the primary pod just disappears unexpectedly, like during a node failure or a container crash. No graceful shutdown, no demotion, just gone.

I ran two test scenarios, each cycling the primary role across all three pods by force-deleting the current primary three times in a row. First, a 5GB TPC-C database idle. Then, that same 5GB database under sustained HammerDB TPC-C load. Six force-deletes total, six successful automatic failovers. I’ll walk through the error log from the promoted replica, the operator’s detection and recovery behavior, and the full timing data.

Read on to see how Anthony’s SQL Server Kubernetes operator handles when things go bump in the night.

Leave a Comment

Using the Dedicated Admin Connection in SQL Server

Garry Bargsley forces down the door:

It’s 2 AM. Your phone is going off. Users can’t connect to the application, and when you open SSMS to investigate, the connection spinner just keeps spinning. SQL Server is alive; you can see the process running, but it’s too overwhelmed to let you in. You need to get in there and kill something, but you can’t get a connection to do it. This is exactly the scenario the Dedicated Admin Connection DAC was built for. And if you haven’t set it up yet, now is the time. Because when you need it, you really need it.

Because there is a preparatory step, it’s important to run that while the instance is in a healthy state. That way, it’ll be available to you when the instance is at the edge of failure.

Leave a Comment

Syncing Logins across Failover Groups for Managed Instances

Andy Brownsword gets replicating:

Failover Groups for Managed Instances are a great option to replicate data, but they don’t replicate key instance elements – one of which is logins that live in the master database. If left unchecked, failovers leave systems unable to connect and panic ensues.

To alleviate this we’ll look at a script to synchronise logins and permissions across replicas.

Click through for a link to the script and an explanation of what’s going on with it.

Leave a Comment

Increasing CPU Capacity or Tuning Queries

John Deardurff explains how to make a choice:

Recently while discussing the Task Execution Model and Thread Scheduling, I was asked the following question, When discussing worker threads, how can we determine whether we should increase CPU capacity or focus on query tuning? This is when our worker threads are under pressure and the instance is becoming exhausted?

In my brain, I thought, that is a great question, and it’s exactly the right way to think about worker thread pressure vs. real CPU starvation, especially when worker threads are getting tight. Let’s write a post.

John has a nice discussion of the trade-offs and signals associated with each approach. One third approach I might add is caching in the application(s), if applicable. This is especially useful if a significant fraction of the queries access static or nearly-static data.

Leave a Comment

Write Storms and PostgreSQL

Shaun Thomas talks checkpoints:

Every database has to reconcile two uncomfortable truths: memory is fast but volatile, and disk is slow but durable. Postgres handles this tension through its Write-Ahead Log (WAL), which records every change before it happens. But the WAL can’t grow forever. At some point, Postgres needs to flush all those accumulated dirty pages to disk and declare a clean starting point. That process is called a checkpoint, and when it goes wrong, it can bring throughput to its knees.

One thing I would note is that direct-attached nVME storage is approximately 1 order of magnitude slower than RAM. Yeah, that’s still a lot slower, but the gap has closed significantly. If you have PCIe 5 nVME drives (call that 12-14 GB/sec) and relatively slow RAM (20 GB/sec), it’s getting close to on par. But once you move past the top-of-the-line for disk speed, you add more orders of magnitude and everything Shaun describes becomes a problem again.

Jeremy Schneider offers a follow-up involving autovacuum_cost_delay:

A few days ago, Shaun Thomas published an article over on the pgEdge blog called [Checkpoints, Write Storms, and You]. Sadly a lot of corporate blogs don’t have comment functionality anymore. I left a few comments [on LinkedIn], but overall let me say this article is a great read, and I’m always happy to see someone dive into an important and overlooked topic, present a good technical description, and include real test results to illustrate the details.

I don’t have any reproducible real test results today. But I have a good story and a little real data.

Check out both of those articles.

Leave a Comment

Accelerated Database Recovery in tempdb for SQL Server 2025

Rebecca Lewis looks into a feature:

Two weeks ago I covered the Resource Governor changes in SQL Server 2025 — specifically, capping how much tempdb data space a workload group can consume. That was the data-file side. For the log side, SQL Server 2025 now lets you enable Accelerated Database Recovery (ADR) on tempdb. Enable it and cancelled queries stop grinding, the tempdb log stops bloating, and recovery gets faster. Sounds like an easy yes — but you’ve got to read the fine print.

Click through for that fine print.

Comments closed