Thoughts On Reliability

Stuart Moore wants to rename Site Reliability Engineering:

The word “Site” in the IT domain typically refers to either a physical location (data center site) or an application (web site); however, the heart of the definition is sociotechnical, not strictly technology. From an undated (seriously, Google?) interview with Ben Traynor, the founder of the SRE movement: “… we have a bunch of rules of engagement, and principles for how SRE teams interact with their environment — not only the production environment, but also the development teams, the testing teams, the users, and so on.” While the previous paragraph of that interview specifically focuses on the type of work that’s being done by Google’s SRE team, these rules of engagement show that SRE’s should be concerned with the entire value stream of service delivery including not only operations, but development, testing, and ultimately the end user experience.  In, other words. SRE’s are concerned with the reliability of the whole service, not just the technical parts.

And Brent Ozar reviews Database Reliability Engineering:

Jump to page 189, the Data Replication section of Chapter 10. Campbell & Majors explain the differences between:

  • Single-leader replication – like Microsoft SQL Server’s Always On Availability Groups, where only one server can accept writes for a given database
  • No-leader replication – like SQL Server’s peer-to-peer replication, where any node can accept writes
  • Multiple-leader replication – like a complex replication topology where only 2-3 nodes can accept writes, but the rest can accept reads

The single-leader replication discussion covers pages 190-202 and does a phenomenal job of explaining the pros & cons of a system like Availability Groups. Those 12 pages don’t teach you how to design, implement, or troubleshoot an AG. However, when you’ve finished those 12 pages, you’ll have a much better understanding of when you should recommend a solution like that, and what kinds of gotchas you should watch out for.

That’s what a Database Reliability Engineer does. They don’t just know how to work with one database – they also know when certain features should be used, when they shouldn’t, and from a big picture perspective, how they should build automation to avoid weaknesses.

I can also recommend the Database Reliability Engineering book.  I’ve not seen the finished product yet (it’s buried in my reading list) but I do like it as a challenge for DBAs and developers to step up their games.

Related Posts

Blockchain For Business Notes

Allison Tharp has some notes on an edX course entitled Blockchain for Business.  This looks like it will be a multi-part series.  Part one: A distributed ledger is a data structure that is spread across multiple computers (which are usually spread across locations or regions).  Distributed ledger technologies have three basic components: A data model to capture the […]

Read More

Take The 2018 Data Professional Salary Survey

Brent Ozar has the 2018 edition of his Data Professional Salary Survey: A few things to know: It’s totally anonymous (we’re not getting your email, IP address, or anything like that.) It’s open to all database platforms. As with last year’s results, we’ll publish the raw data in Excel for anyone to analyze. If you want to […]

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

Categories

December 2017
MTWTFSS
« Nov  
 123
45678910
11121314151617
18192021222324
25262728293031