Aaron Bertrand checks a box:
If there has been one constant throughout my career, it’s change. As applications become more complex and we continue improving reliability, there will always be the next patch, upgrade, new replica, new cluster, and even new cloud region – or moving to the cloud in general. For complex architectures, multiple teams are often actively involved, and even more who want to be “in the know” during any changes.
We use tickets (JIRA) to track and document the work, and incidents (FireHydrant) to expose the status to internal and external customers. But these are complex systems to keep current in real-time. And while nearly everything we do is scripted, broad audiences can’t consume code – even when saturated with comments. Since multiple teams are involved, the code is scattered across disparate things like runbooks, which are not easy or desirable to combine. How can a wide range of people stay coordinated during a major change?
For more complicated tasks, I’m all-in on creating either checklists or dedicated runbooks. I have a client that uses merge replication, and every once in a while, we need to rebuild replication. In that case, we have a more detailed runbook with step-by-step instructions, but this is great for keeping track of complex processes, whether or not they go cross-team.
Also, callout to the greatest Site Reliability Engineer ever to play the game, Mario Lemieux.