Monitoring At Stack Overflow

Nick Craver has been driven off the bend by monitoring and we get to enjoy the fruits of it:

…but evidently some people think of other things. Those people are obviously wrong, but let’s continue. When I’m not a walking zombie after reading a 10,000 word blog post some idiot wrote, I see monitoring as the process of keeping an eye on your stuff, like a security guard sitting at a desk full of cameras somewhere. Sometimes they fall asleep–that’s monitoring going down. Sometimes they’re distracted with a doughnut delivery–that’s an upgrade outage. Sometimes the camera is on a loop–I don’t know where I was going with that one, but someone’s probably robbing you. And then you have the fire alarm. You don’t need a human to trigger that. The same applies when a door gets opened, maybe that’s wired to a siren. Or maybe it’s not. Or maybe the siren broke in 1984.

I know what you’re thinking: Nick, what the hell? My point is only that monitoring any application isn’t that much different from monitoring anything else. Some things you can automate. Some things you can’t. Some things have thresholds for which alarms are valid. Sometimes you’ll get those thresholds wrong (especially on holidays). And sometimes, when setting up further automation isn’t quite worth it, you just make using human eyes easier.

This is a really good post covering monitoring techniques at a high level and getting into specific implementations at Stack Overflow.

Related Posts

Finding Queries to Cache In-App

Brent Ozar provides guidance on the types of queries you might want to cache in your application: Question 2: Will out-of-date data really hurt? Some data absolutely, positively has to be up to the millisecond, but you’d be surprised how often I see data frequently queried out of the database when the freshness doesn’t really matter. […]

Read More

Problems Distributed Systems Experience

RJ Zaworski gives us examples of the types of problems you can run into with distributed systems: Time limits: ending the neverendingHere’s one to ponder: how long can a long-running action go on before the customer (even a very patient, very digital customer) loses all interest in the outcome?Pull up a chair. With no upper […]

Read More

Categories

November 2018
MTWTFSS
« Oct Dec »
 1234
567891011
12131415161718
19202122232425
2627282930