Thoughts On Reliability

Stuart Moore wants to rename Site Reliability Engineering:

The word “Site” in the IT domain typically refers to either a physical location (data center site) or an application (web site); however, the heart of the definition is sociotechnical, not strictly technology. From an undated (seriously, Google?) interview with Ben Traynor, the founder of the SRE movement: “… we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment — not only the production environment, but also the development teams, the testing teams, the users, and so on.” While the previous paragraph of that interview specifically focuses on the type of work that’s being done by Google’s SRE team, these rules of engagement show that SRE’s should be concerned with the entire value stream of service delivery including not only operations, but development, testing, and ultimately the end user experience.  In, other words. SRE’s are concerned with the reliability of the whole service, not just the technical parts.

And Brent Ozar reviews Database Reliability Engineering:

Jump to page 189, the Data Replication section of Chapter 10. Campbell & Majors explain the differences between:

  • Single-leader replication – like Microsoft SQL Server’s Always On Availability Groups, where only one server can accept writes for a given database
  • No-leader replication – like SQL Server’s peer-to-peer replication, where any node can accept writes
  • Multiple-leader replication – like a complex replication topology where only 2-3 nodes can accept writes, but the rest can accept reads

The single-leader replication discussion covers pages 190-202 and does a phenomenal job of explaining the pros & cons of a system like Availability Groups. Those 12 pages don’t teach you how to design, implement, or troubleshoot an AG. However, when you’ve finished those 12 pages, you’ll have a much better understanding of when you should recommend a solution like that, and what kinds of gotchas you should watch out for.

That’s what a Database Reliability Engineer does. They don’t just know how to work with one database – they also know when certain features should be used, when they shouldn’t, and from a big picture perspective, how they should build automation to avoid weaknesses.

I can also recommend the Database Reliability Engineering book.  I’ve not seen the finished product yet (it’s buried in my reading list) but I do like it as a challenge for DBAs and developers to step up their games.

Related Posts

Meidinger’s Law

Eugene Meidinger shares his thoughts on the future: Since we are prognosticating, I want to take a guess at one of the constraints limiting the future.  I present you with Meidinger’s law: An industry’s growth is constrained by how much your junior dev can learn in two years. Let me explain. On my team, one […]

Read More

Think Logically When Debugging

Kenneth Fisher explains his debugging technique: Almost everything is made up of smaller pieces. When you are running across a problem, (well, after you check the stupid things) start breaking up what you are doing into smaller pieces. I’m going to give a simplified version of an actual example of something I did recently. No […]

Read More


December 2017
« Nov Jan »