Betsy is a Technical Writer for Google in NYC specializing in Site Reliability Engineering. She co-authored the books Site Reliability Engineering: How Google Runs Production Systems and The Site Reliability Workbook: Practical Ways to Implement SRE. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally-distributed data centers.
Stephen is a Site Reliability Engineer in Google’s London office. His book The
Site Reliability Workbook: Practical Ways to Implement SRE drew from his working at introducing SRE practices to Google customers on the Customer Reliability Engineering team. He has been an SRE at Google since 2011, and has previously worked on Google Ads, and Google App Engine.
Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!
————————————
Do we see Site Reliability Engineering (SRE) as the future of DevOps? Definitely not. DevOps is really a broad set of principles and practices; SRE is a very specific implementation of that. The two are not mutually exclusive. You can look at all these DevOps principles and achieve it by applying SRE.
In the book Accelerate, they separated out four key metrics that make for a successful team: lead time, MTTR, batch size, and change success rate. All of these metrics boil down to continuous delivery – how quickly can we make changes, and how fast can we recover when things go awry?
But we look too much at this desired outcome – we’re releasing very often, with a low rate of failure – and sometimes lose sight of how we get there. It’s that implementation space where SRE fills a gap. Applying SRE principles is like having a brake on the car; we can say, hey, there’s a corner coming up, slow down and let’s handle this safely. We’ll go slow and steady for a bit, then speed up once we’re on better ground.
We commonly hear people say, “Oh, SRE works for you, because you’re a Google – a massive, web-native company.” But there are SRE things you can do at any size of company, even without a dedicated SRE team. There are some patterns and principles we wanted to stress, which is why we wrote both SRE books. Particularly around how to manage and embrace risk, and how to establish a balanced SLO and an error budget policy. These things are fundamental to a well running SRE team, and it’s something your customers need.
Two Modes of Development: SRE’s don’t take direct responsibility for releases, and our job isn’t just to be a brake on development. In fact, we operate in two modes. The first one is if we’re consistently within that error SLO, and we’re not consuming enough of our error budget – that’s actually hampering our innovation. So SRE should be advocating for increasing the speed of the pipeline. Congratulations, we’re in that sweet spot that DevOps is aiming for, low friction releases – we are a well performing team.
But the second mode is often the default mode – and that’s not stepping on the gas, it’s the ability to slow down. If we’re constantly running out of the error budget, then we have to slow things down – our rate of failure is simply too high, it’s not sustainable. We have to do whatever it takes to make it more reliable, and not defer it as debt. That’s the fourth attribute we want with DevOps – a low rate of failures in our production releases.
Error Budgets: One of the most frequent questions we got after publishing our first book had to do with forming an error budget policy. It’s actually a concept that’s pretty easy to apply at other organizations.
You can’t get away from the fact that when it comes to instability, releases are one of the primary causes. If we stop or gate releases, the chance of a release causing a problem goes way down. If things are going fine, it’s the SRE’s job to call it out that we’re being TOO reliable – let’s take more risks. And then, when we’ve run out of error budget, we want to have a policy agreed upon in advance so we can slow down the train.
We’ve seen this error policy take a number of different shapes. At one company we engaged with that’s very Agile-focused, when they know a system isn’t meeting customer expectations, developers can only pull items off the backlog if they’re marked as a postmortem action item. Another company uses pair programming a lot. So during that error budget overage period – the second mode – they mandate that one pair must be devoted purely to reliability improvements.
Now that’s not how we do it at Google – but it’s totally effective. We see companies like Atlassian, IBM, and VMWare all using error budgets and talking about it in public forums. One thing is for sure though – this policy needs to be explicit, in writing, agreed upon in advance – and supported by management. It’s so, SO important to have this discussion before you have the incident.
Business stakeholders and execs sometimes fight for zero downtime, 100% availability. So let’s say you’re a mobile gaming platform. Any downtime for you means money lost and perhaps customers out the door. So, how are you going to get to 100% reliability? Your customers aren’t getting 99.9% reliability out of their phones. How are you going to fix that? Once you point out that people won’t even notice a small amount of downtime in all likelihood – you end up with a financial argument, which has an obvious answer. I can spend millions of dollars for nearly no noticeable result, or accept a more reasonable availability SLO and save that money and stress in trying to attain perfection.
A competent SRE embraces risk. Our goal is not to slow down or stop releases. It’s really about safety, not stability just for its own sake. Going back to that car analogy – If your goal is 100%, then the only thing we can do is jam on the brakes, immediately. That’s a terrible approach to driving if you think about it – it’s not getting you where you need to be. Even pacemaker companies aren’t 100% defect free; they have a documented, acknowledged failure rate. It might be one in 100M pacemakers that fail, but it still happens and that 99.9999% success rate is still the number they strive for.
Blameless postmortems: It’s counterproductive to blame other people, as it ends up hiding the truth. And it’s not just blaming people – it’s too easy sometimes to blame systems. This is an emotional thing – it gives us a warm fuzzy feeling to come up with one pat answer. But at that point we stop listing all the other factors that could have contributed to the problem.
We don’t perform postmortems for each and every incident at Google – only when we’re sure it has a root cause that’s accurate, and it could be applicable to other outages – something we can actually learn from. We’re very careful not to make never-ending lists of everything that can go wrong, and to pick out the really important things that need to be addressed. And your action items need to be balanced. Some should be comprehensive, some should be structural, some should be short-term fixes, and they can’t all be red hot in priority. Let’s say you have a lower priority action item that would need to be done by another team, for example. You might legitimately want to defer on that, instead of wasting political capital trying to drop work on other teams outside your direct control.
It’s vitally important to keep postmortems on the radar of your senior leadership. We have a daily standup meeting in one area here at google, where management looks over anything that’s happened in the past few days. We go through each postmortem, people present on the issue and the followup items they’ve identified, and management weighs in and provides feedback. This makes sure that the really important fixes are tracked through to completion.
SRE Antipatterns: The magical title change is something that crops up quite often. For example, you know how sometimes developers are given a book or sent to a training class, and then a month later they’re labeled as “Agile”? That same thing happens in the SRE world. Sometimes we see companies taking sysadmins, changing one or two superficial things, and labeling them “DevOps Engineers” or some other shiny new title. But nothing around them has really changed – incentives haven’t changed, and executives have not bought in to making changes that are truly going to be lasting and effective.
Another antipattern is charging ahead without getting that signoff from management. Executive level engagement on the SRE model, especially the part that has teeth – SLOs and Error Budgets – is a critical success/failure indicator. This is how we gauge whether we’re in the first working model – we’re reliable enough, let’s go faster – or in our second working model , customers are suffering, give us the resources we need. A numerical error budget, a target that is agreed upon and very specific consequences that happen when that budget gets violated – that needs to be consistently enforced, from the top.
A lot of times we find that it doesn’t take a lot of convincing to get executives onboard, but you do have to have that conversation. We talk to the leadership, who have an emotional need to see their company have a reliable product, and we help them understand it with numbers and measurements, instead of gut feel. The fact that once a system becomes unreliable, it can be months or even years until we can bring it back to a reliable state – that these are complex systems and it will take constant attention to keep it running smoothly.
Another antipattern that thankfully we don’t see too often is where SRE becomes yet another silo, a gatekeeper. It’s really important to crosspollinate knowledge, so production knowledge is shared. If it’s just one group that controls any and all production or release ownership and jealously guards that privilege, we’ve failed. So at Google, we do something called “Mission Control”, where developers can join an SRE team for 1-2 quarters. It’s a great way of disseminating knowledge and getting coders to see what it’s like on the other side of the fence.
DIRT and GameDays: We find that it’s absolutely vital to practice for failure. Netflix and others obviously have had a lot of success with Simian Army and Chaos Monkey, where SREs are whacking production systems at random to test availability. We use this approach somewhat at Google, our annual DIRT exercises– disaster recovery testing, which are company-wide. But locally, we use something less intimidating for newbies and entirely theoretical and very low-key – something we call a Wheel of Misfortune exercise.
It works almost like a D&D board game. It’s held once a week, and lots of SRE’s show up at the arena as onlookers, because it’s fun. There’s a gamemaster present, who throws the scenario – something that actually happened, not too long ago – on a whiteboard. A SRE takes on the role of a “player”, someone who’s handling incident response. As they walk through how they’d handle troubleshooting the incident and debugging, the gamemaster responds with what happens. A lot of times the gaps come down to documentation – where’s the playbook for this system? What information would have helped that support team get to a root cause faster? It’s great for the audience, because it’s very engaging and collaborative – a great group socialization exercise. We always end up with lots of creative action items that help us down the road.
Livesite Support: We do feel that it’s vital that development teams do some kind of production support. Now we throw around that “at least 5%” number a lot, but that’s really just a generic goal. The real aim here is to break down that palace wall, that silo between developers and operations. Many people assume that at Google every team has SRE’s, but that’s not the case. In fact our default model is 100% developer supported services end to end – SRE’s are really used more for high profile public facing or mission critical systems. The more dev ownership of production you can get, the more you’ll be able to sustainably support production.
Reducing toil is always top of mind for us. Any team tasked with operational work will have some degree of toil – manual, repetitive work. While toil can never be completely eliminated, it can and should be watched carefully. Left unchecked, toil naturally grows over time to consume 100% of a team’s resources. So it’s vital to the health of the team to focus relentlessly on tracking and reducing toil – that’s not a luxury, it’s actually necessary for us to survive.
At Google, we keep a constant eye on that toil/project work dichotomy, and we make sure that it’s capped. We’ve found that a well run SRE team should be spending no more than 50% of its time on toil. A poorly run SRE team might end up with 95% toil. That leaves just 5% of time for project work. At that point you’re bankrupt – you don’t have enough time to drive down your toil or eliminate the things that are causing reliability issues, you’re just trying to survive, completely overwhelmed. So part of that policy you agree upon with your stakeholders must be enforcing that cap on toil, leaving at least half of your capacity for improving the quality of your services. Failing to do that is definitely an antipattern, because it leads to becoming overwhelmed by technical debt.
References:
[sre] – Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy, O’Reilly Media, April 2016, ISBN-10: 9781491929124, ISBN-13: 978-1491929124
[ghbsre] – The Site Reliability Workbook: Practical Ways to Implement SRE, By Niall Murphy, David Rensin, Betsy Beyer, Kent Kawahara, Stephen Thorne, O’Reilly Media, August 2018, ISBN-10: 1492029505, ISBN-13: 978-1492029502
[gsre] – https://landing.google.com/sre/book.html – free PDF versions of the revised [sre] text and the followup handbook.
[kieran] – “Managing Misfortune for Best Results”, 8/30/2018, Kieran Barry, SREcon EMEA, https://www.usenix.org/node/218852. This is a great overview of the Wheel of Misfortune exercises in simulating outages for training, and some antipatterns to avoid.