Jane Miceli of Micron and I are doing a presentation on “Postmortems With Teeth … But No Bite!” at DevOps Days in Boise. We wanted to share an article that can go into more detail than we’ll be able to fit into our 30 minute window. Enjoy!
It’s been said that a person’s character is revealed when things go wrong. So when things go wrong at your enterprise – what happens? What kind of character does your company show when the chips are down?
We’re guessing one of two things happen. First is the “outage? What outage?” type of response. It’s possible that your company has NO postmortem process; when failure happens, there might be a few words, but it’s kept informal, within the family. That’s a big mistake, for reasons we’ll go into below. The second and most common is the “rub the puppy’s nose in it” response – where the bad employee(s) that triggered the outage are named, shamed, and blamed. We’d like to spend a few minutes on why both of these common reactions are so harmful, and set you up for better success with a proven antidote -the blameless postmortem.
Why We Need Postmortems
I tell the story in my book about when I was working for an insurance company. On my way in to work, I stopped by to grab a coffee and a donut (OK, several donuts!) and took a glance at the Oregonian newspaper. I almost spit out my coffee, right there at the counter. There, at the top of the front page, was my company – right where we did NOT want to be. Someone had sent out a mailer, and it had included personal information (names, addresses, DOB, SS#). Worse, many of these mailers ended up in the wrong subscriber’s hands. It was a massive data leak, and there was no place for us to hide from it. I knew the team that had made this mistake – I even knew who’d sent out the mailer. Hmm, I thought, as I headed into the office. We’ve got a long week of damage control ahead of us. I wonder what’s going to happen to Bobby?
And that’s the interesting part. Nothing happened. There was a few high-level meetings with executives – no engineers or operators allowed in the room of course – on how to best position us and recover from the PR hits we were taking. But while nothing happened to Bobby – which was a good thing, he was just tired and had made a mistake – we didn’t learn anything from it either. No report, no knowledgebase article – it was like nothing had happened. It was only a matter of time until the next time a tired operator triggered yet another leak of sensitive information.
This type of reaction is understandable, and it’s rooted deep in our psychology. None of us likes to look too closely at our failures or mistakes. But without understanding that mistakes and errors are a normal part of any complex system, we’re missing out on a huge opportunity to learn. And you could make a strong argument that without a postmortem process, any DevOps process is handcuffed. Winning companies that we admire – names like Amazon, Google, Etsy – all make the same mistakes that other companies make. There’s a critical difference though in how they learn from those mistakes, and how they view them.
Why We Need BLAMELESS Postmortems
A blameless postmortem focuses on identifying contributing causes of an incident, without calling out any particular individual team for being “bad” or handling things incompetently. It assumes good intentions and that everyone acted in the proper way – given the information, capabilities and processes available at the time. By investigating more into the context behind a failure – what caused that operator to make that decision at 1:30 in the morning? – we can create safer processes.
And it’s a critical part of several companies DevOps implementations. Google, for example, views blameless postmortems as being a critical part of their culture – so much so that both the excellent “Site Reliability Engineering” and the SRE Handbook have entire chapters on it. Etsy in particular has made some very profound statements on blameless postmortems:
One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!” …Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event…
Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every “mistake” is seen as an opportunity to strengthen the system. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.
…We believe that this detail is paramount to improving safety at Etsy. …If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?
There’s a great book called “Barriers and Accident Prevention” by Erik Hollnagel that deserves more reading than it gets. In it, Erik Hollnagel says the “Bad Apple” theory above – that if we punish or remove the “bad apples” that are causing these failures, that we’ll improve safety – is fundamentally flawed because it assumes bad motives or incompetence:
We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.
Accidents Are Emergent; Accidents Are Normal
The root fallacy here is thinking that accidents are abnormal or an anomaly. Accidents or mistakes are instead a byproduct; they are emergent, a consequence of change and the normal adjustments associated with complex systems. This is the true genius behind the SRE movement begun by Google; instead of striving for the impossible (Zero Defect meetings! Long inquisitor-type sessions to determine who is at fault and administer punishment over any failure!) – they say that errors and mistakes are going to happen, and it is going to result in downtime. Now, how much is acceptable to our business stakeholders? The more downtime (mistakes) we allow – as a byproduct of change – the faster we can innovate. But that extra few 9’s of availability – if the business insists on it – means a dramatic slowdown to any change, because any change to a complex system carries the risk of unintended side effects.
I’m turning to John Allspaw again as his blog post is (still) unequalled on the topic:
Of course, for all this, it is also important to mention that no matter how hard we try, this incident will happen again, we cannot prevent the future from happening. What we can do is prepare: make sure we have better tools, more (helpful) information, and a better understanding of our systems next time this happens. Emphasizing this often helps people keep the right priorities top of mind during the meeting, rather than rushing to remediation items and looking for that “one fix that will prevent this from happening next time”. It also puts the focus on thinking about what tools and information would be helpful to have available next time and leads to a more flourishing discussion, instead of the usual feeling of “well we got our fix, we are done now”.
…We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.
So, good postmortems don’t stop at blaming the silly / incompetent / dangerous humans, and recognizes that mistakes and disasters are a normal part of doing business. Our job is to collect as much information as possible so we can provide more information to the people who need it the next time that combination of events takes place, shortening the recovery cycle.
I remember saying this when I was at Columbia Sportswear, long before I knew what a blameless postmortem was, when something went awry: “I’m OK with making mistakes. I just want to make new and different mistakes.”
Stopping At Human Causes Is Lazy
During the postmortem process, the facilitator helps the team drill down a little deeper behind human error:
… As we go along the logs, the facilitator looks out for so-called second stories – things that aren’t obvious from the log context, things people have thought about, that prompted them to say what they did, even things they didn’t say. Anything that could give us a better understanding of what people were doing at the time – what they tried and what worked. The idea here being again that we want to get a complete picture of the past and focusing only on what you can see when you follow the logs gives us an impression of a linear causal chain of events that does not reflect the reality.
Etsy didn’t invent that; this comes from the great book “Behind Human Error” by David Woods and Sidney Dekker, which distinguished between the obvious (human) culprits and the elusive “second story” -what caused the humans involved to make a mistake:
Human error is seen as cause of failure
Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure
Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away
Only by constantly seeking out its vulnerabilities can organizations enhance safety
The other giant in the field is Sidney Dekker, who called processes that stop at human error as the “Bad Apple Theory”. The thinking goes that if we get rid of bad apples, we’ll get rid of human-triggered errors. This type of thinking is seductive, tempting. But it simply does not go far enough, and will end up encouraging less transparency. Engineers will stop trusting management, information flow upwards will dry up. Systems will become harder to manage and unstable as less information is shared even within teams. Lacking understanding of the context behind how an incident occurred practically guarantees a repeat incident.
There Is No Root Cause (The Problem With The Five Whys)
Reading accounts about any disaster – the 1996 Everest disaster that claimed 8 lives, the Chernobyl disaster, even the Challenger explosion – there is never one single root cause. Almost always, it’s a chain of events – as Richard Cook put it, failures in complex systems require multiple contributing causes, each necessary but only jointly sufficient.
This goes against our instincts as engineers and architects, who are used to reducing complex problems down as much as possible. A single, easily avoidable root cause is comforting – we’ve plugged the mouse hole, that won’t happen again. Whew – all done! But complex systems can’t be represented as a cherry-picked list of events, a chain of dominoes; pretending otherwise means we trick ourselves into a false sense of security and miss the real lessons.
The SRE movement is very careful not to stop at human error; it’s also careful not to stop at a single root cause, which is what the famous “Five Whys” linear type drilldown encouraged by Toyota promotes. As the original SRE book put it:
This is why we focus not on the action itself – which is most often the most prominent thing people point to as the cause – but on exploring the conditions and context that influenced decisions and actions. After all there is no root cause. We are trying to reconstruct the past as close to what really happened as possible.
Who Needs To Be In The Room?
Well, you’re going to want to have at least a few people there:
- The engineer(s) / personnel most directly involved in the incident
- A facilitator
- On-call staff or anyone else that can help with gathering information
- Stakeholders and business partners
Why the engineers/operators involved? We mentioned a little earlier the antipattern of business- or executive-only discussions. You want to have the people closest to the incident telling the story as it happens. And, this just happens to be the biggest counter to that “lack of accountability” static you are likely to get. John Allspaw put it best:
A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items. So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.
…Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures. We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
Why a facilitator? This is a “playground umpire”, someone who enforces the rules of behavior. This person’s job is to keep the discussion within bounds.
The Google SRE book goes into the psychology behind disasters and the role of language in great detail. But you’re going to want to eliminate the use of counterfactuals: the belief that if only we had known, had done that one thing different, the incident would not have happened – the domino theory. Etsy is very careful to have the facilitator watch for any use of the phrases “would have”, “should have”, etc in writeups and retrospectives:
Common phrases that indicate counterfactuals are “they should have”, “she failed to”, “he could have” and others that talk about a reality that didn’t actually happen. Remember that in a debriefing we want to learn what happened and how we can supply more guardrails, tools, and resources next time a person is in this situation. If we discuss things that didn’t happen, we are basing our discussion on a reality that doesn’t exist and are trying to fix things that aren’t a problem. We all are continuously drawn to that one single explanation that perfectly lays out how everything works in our complex systems. The belief that someone just did that one thing differently, everything would have been fine. It’s so tempting. But it’s not the reality. The past is not a linear sequence of events, it’s not a domino setup where you can take one away and the whole thing stops from unraveling. We are trying to make sense of the past and reconstruct as much as possible from memory and evidence we have. And if we want to get it right, we have to focus on what really happened and that includes watching out for counterfactuals that are describing an alternative reality.
Interestingly enough, it’s usually the main participants that are the most prone to falling into this coulda-shoulda-woulda type thinking. It’s the facilitator’s job to keep the discussion within bounds and prevent accusations / self-immolation.
How To Do Blameless Postmortems Right
There’s two great postmortem examples we often point to: the first is found in both the SRE books (see the Appendix). The second – which Jane often uses – was a very prominent outage at GitLab, found here: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
A great writeup like this doesn’t come from nowhere. Likely, the teams shared a draft internally – and even had it vetted for completeness by some senior architects/engineers. The reviewers will want to make sure that the account has a detailed timeline, showing the actions taken, what expectations and assumptions were made, and the timeline. They’ll also want to make sure the root cause is deep enough, that information was broadcasted appropriately, and the action items are complete and prioritized correctly.
If you have an hour long postmortem review, you may spend more than half of that time going over a timeline. That seems like an absurd waste of time, but don’t skip it. During a stressful event, it’s easy to misremember or omit facts. If the timeline isn’t as close as possible to what actually happened, you won’t end up with the right remediation steps. And, it may also expose gaps in your logging and telemetry.
Once the timeline is set, it’s time to drill down a little deeper. Google keeps the discussion informal but always aimed at uncovering the Second Story:
This discussion doesn’t follow a strict format but is guided by questions that can be especially helpful, including: “Did we detect something was wrong properly/fast enough?”, “Did we notify our customers, support people, users appropriately?”, “Was there any cleanup to do?”, “Did we have all the tools available or did we have to improvise?”, “Did we have enough visibility?”. And if the outage continued over a longer period of time “Was there troubleshooting fatigue?”, “Did we do a good handoff?”. Some of those questions will almost always yield the answer “No, and we should do something about it”. Alerting rules, for example, always have room for improvement.
What Makes For A Good Action Item?
Action items are how you complete the loop – how you give a postmortem teeth, so to speak.
Interestingly, Etsy finds its usually comes down to making more and higher quality information available to those on the scene via metrics, logging, dashboarding, documentation, and error alerts – i.e. building a better guardrail:
There is no need (and almost certainly no time) to go into specifics here. But it should be clear what is worthy of a remediation item and noted as such. Another area that can almost always use some improvement is metrics reporting and documentation. During an outage there was almost certainly someone digging through a log file or introspecting a process on a server who found a very helpful piece of information. Logically, in subsequent incidents this information should be as visible and accessible as possible. So it’s not rare that we end up with a new graph or a new saved search in our log aggregation tool that makes it easier to find that information next time. Once easily accessible, it becomes a resource so anyone can either find out how to fix the same situation or eliminate it as a contributing factor to the current outage.
…this is not about an actor who needs better training, it’s about establishing guardrails through critical knowledge sharing. If we are advocating that people just need better training, we are again putting the onus on the human to just have to know better next time instead of providing helpful tooling to give better information about the situation. By making information accessible the human actor can make informed decisions about what actions to take.
Ben Treynor, the founder of SRE, said the following:
A postmortem without subsequent action is indistinguishable from no postmortem. Therefore, all postmortems which follow a user-affecting outage must have at least one P bug associated with them. I personally review exceptions. There are very few exceptions.
Vague or massive bowling-ball sized to-do’s are to be avoided at all cost; these are often worse than no action item at all. Google and Etsy both are very careful to make sure that action items follow the SMART criteria – actionable, measurable, relevant. In fact, Google has a rule of thumb that any remediation action item should be completed in 30 days or less; if these action items linger past that, they’re revisited and either rewritten, reprioritized, or dropped.
Completing the Loop
Once the report is written up and finalized – and available to all other incident responders for learning – you’re not quite done yet. Google for example tells of a story where an engineer that caused a high-impact incident was commended and even given a small cash reward for quick mitigation:
Google’s founders Larry Page and Sergey Brin host TGIF, a weekly all-hands held live at our headquarters in Mountain View, California, and broadcast to Google offices around the world. A 2014 TGIF focused on “The Art of the Postmortem,” which featured SRE discussion of high-impact incidents. One SRE discussed a release he had recently pushed; despite thorough testing, an unexpected interaction inadvertently took down a critical service for four minutes. The incident only lasted four minutes because the SRE had the presence of mind to roll back the change immediately, averting a much longer and larger-scale outage. Not only did this engineer receive two peer bonuses82 immediately afterward in recognition of his quick and level-headed handling of the incident, but he also received a huge round of applause from the TGIF audience, which included the company’s founders and an audience of Googlers numbering in the thousands. In addition to such a visible forum, Google has an array of internal social networks that drive peer praise toward well-written postmortems and exceptional incident handling. This is one example of many where recognition of these contributions comes from peers, CEOs, and everyone in between.
We’ve seen a couple great examples of companies using the incident report and postmortem process to help with their DR role playing exercises, sharing incident writeups in a monthly newsletter or for group discussions. But visibly rewarding people for doing the right thing – as Google handled the situation above – is about as far as you can get from the “rub the puppy’s nose in it” antipattern. We think you’ll create a safer organization when you foster a postmortem process that encourages sharing information and understanding context – versus naming, shaming, and blaming.
- John Allspaw, “Blameless PostMortems and a Just Culture”, https://codeascraft.com/2012/05/22/blameless-postmortems/. Hands down, the best single article I’ve ever seen written on the subject.
- “Practical Postmortems at Etsy”, Daniel Schauenberg, https://www.infoq.com/articles/postmortems-etsy
- Chapter 15, Portmortem Culture: Learning From Failure (Google SRE book – https://landing.google.com/sre/sre-book/chapters/postmortem-culture/). The discussions on hindsight and outcome bias are particularly valuable. A must-read.
- Great postmortem example: https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ We love the detailed timeline here as a template.
- Sample (bogus!) postmortem entry here: https://landing.google.com/sre/sre-book/chapters/postmortem/. Note the sections on Lessons Learned: What went well, what went wrong, where we got lucky. There’s an extensive timeline and a link to supporting info (i.e. the monitoring dashboard). Impact, summary, root causes, trigger, resolution, detection. And then a list of action items and their status.
- Resilience Engineering, Hollnagel, Woods, Dekker and Cook, https://www.amazon.com/Resilience-Engineering-Concepts-David-Woods-ebook/dp/B07731DD38
- Hollnagel’s talk, On How (Not) To Learn From Accidents – https://www.uis.no/getfile.php/1322751/Konferanser/Presentasjoner/Ulykkesgransking%202010/EH_AcciLearn_short.pdf
- Sidney Dekker, The Field Guide to Understanding Human Behavior, https://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/0754648265/
- Morgue software tool – https://github.com/etsy/morgue