Thomas Williams has a multi-part series. Part 1 is a big picture primer:
It would be nice if every problem, now and in the future, was automatically monitored by an intuitive, cheap, flexible tool that only raised the alarm when there was a real problem, at the right time, to the right people.
However, in reality, monitoring is a game of whack-a-mole, built on an ad-hoc collection of tools and technologies, requiring know-how to interpret the often puzzling situations which may be high priority – or just background noise.
Part 2 continues the big picture look:
Monitoring is here to stay. There’s never enough disk space, memory, CPU, network throughput. If they ever existed, perfect conditions don’t persist, whether because of a downstream system failing, newly-discovered security issue, change in process, extraordinary amount of load etc. And, systems grow and change to meet new requirements, so yesterday’s monitoring may not meet tomorrow’s uptime goals.
Following on from part 1, here are some further “big picture” considerations for automated monitoring for the DBA:
Part 3 gets to ideas on instrumentation:
With some of the broader ideas out of the way, if there’s no monitoring in place and I was the “accidental”/”default” DBA, there are some metrics I’d definitely want to monitor. However, there’s a ton of real-time metrics I haven’t included below like CPU & memory use, number of current connections, disk I/O as at right now. They’re probably not good candidates for “roll-your-own”-type monitoring.
I consider “alerts” in the table below to include an e-mail, SMS or dashboard item that aims to trigger a response to fix. The collection frequency especially is just a guide – closer to real-time might be desirable, but the trade-off is impact on servers and connected apps.
Check out all three posts so far in the series.