Thoughts On Reliability

Stuart Moore wants to rename Site Reliability Engineering:

The word “Site” in the IT domain typically refers to either a physical location (data center site) or an application (web site); however, the heart of the definition is sociotechnical, not strictly technology. From an undated (seriously, Google?) interview with Ben Traynor, the founder of the SRE movement: “… we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment — not only the production environment, but also the development teams, the testing teams, the users, and so on.” While the previous paragraph of that interview specifically focuses on the type of work that’s being done by Google’s SRE team, these rules of engagement show that SRE’s should be concerned with the entire value stream of service delivery including not only operations, but development, testing, and ultimately the end user experience.  In, other words. SRE’s are concerned with the reliability of the whole service, not just the technical parts.

And Brent Ozar reviews Database Reliability Engineering:

Jump to page 189, the Data Replication section of Chapter 10. Campbell & Majors explain the differences between:

  • Single-leader replication – like Microsoft SQL Server’s Always On Availability Groups, where only one server can accept writes for a given database
  • No-leader replication – like SQL Server’s peer-to-peer replication, where any node can accept writes
  • Multiple-leader replication – like a complex replication topology where only 2-3 nodes can accept writes, but the rest can accept reads

The single-leader replication discussion covers pages 190-202 and does a phenomenal job of explaining the pros & cons of a system like Availability Groups. Those 12 pages don’t teach you how to design, implement, or troubleshoot an AG. However, when you’ve finished those 12 pages, you’ll have a much better understanding of when you should recommend a solution like that, and what kinds of gotchas you should watch out for.

That’s what a Database Reliability Engineer does. They don’t just know how to work with one database – they also know when certain features should be used, when they shouldn’t, and from a big picture perspective, how they should build automation to avoid weaknesses.

I can also recommend the Database Reliability Engineering book.  I’ve not seen the finished product yet (it’s buried in my reading list) but I do like it as a challenge for DBAs and developers to step up their games.

Related Posts

Gaining Business Understanding Through Paying Attention

Laura Ellis lays down some good tips for understanding business problems: I know this sounds somewhat silly. But, when thinking through the steps that I take to solve a business problem, I realized that I do employ a strategy. The backbone of that strategy is based on the principals of solving a word problem. Yes, […]

Read More

Wat-Provenance And Debugging Distributed Systems

Adrian Colyer reviews an interesting paper on debugging distributed systems: Why why-provenance doesn’t work Relational databases have why-provenance, which sounds on the surface exactly like what we’re looking for. Given a relational database, a query issued against the database, and a tuple in the output of the query, why-provenance explains why the output tuple was […]

Read More

Categories

December 2017
MTWTFSS
« Nov Jan »
 123
45678910
11121314151617
18192021222324
25262728293031