Things I Learned From “Ahead In The Cloud”

A few weeks back I finally broke the cover on a book I’ve been meaning to read for some time. Books by / from Amazon architects are actually quite hard to come by, unlike with Google or Microsoft. This seems to be the best written outline I’ve seen yet of Amazon’s cloud adoption framework and some of the lessons they’ve learned in doing large-scale enterprise migrations to the cloud. I was quite shocked by some of the conclusions Stephen presented, as they challenged some of my preconceptions around lift-and-shift. I thought I’d write a little about what I learned as a kind of Cliff Notes to encourage you to check out the book.

It may seem a little odd that I, a lifelong Microsoftie, am writing a review about a book so slanted towards AWS. (And it is, sorry! If all you ever read on the cloud was this book, you’d think Amazon has the only cloud platform in existence.) But I believe the principles in this book – especially around the “halo effect” and the importance of training, and the different migration strategies that Amazon has found viable – are not specific to Amazon/AWS. If you’ve ever read Amazon’s leadership principles, they preach about being customer-focused – not competitor focused. One point of view is sideways – and leads to stagnation / me-too type thinking. The second aims to learn, and is forward thinking. Count me in on that second group. I’m an admirer of Amazon – especially with the two-pizza rule and microservices – and I think Stephen Orban has a lot of experience that everyone can learn from.

Book: “Ahead in the Cloud: Best Practices for Navigating the Future of Enterprise IT” by Stephen Orban

Sayings to Live By

  • Everyone you need to move forward with the cloud is already there, you just have to enable them.
  • Practice makes permanent.
  • All of your assumed constraints are debatable.
  • “Reform the environment and not man; being absolutely confident that if you give man the right environment, he will behave favorably.”—BUCKMINSTER FULLER
  • Use your migration as a forcing function to adopt a DevOps culture
  • “You get the culture you pay for.” – Adrian Cockroft
  • “There’s no compression algorithm for experience.” – Bryan Landerman, Chief Technology Officer, Cox Automotive

Lift and Shift Is Not An Antipattern: Four Different Migration paths

This is in the book, and it’s by far the best part – Stephen outlines four different paths for a migration, from a lift-and-shift approach to a full-on rearchitecture:


(above courtesy “Ahead In The Cloud”)



Is Lift and Shift a Copout? This was the biggest learning point for me from the book. Before this, I’d always assumed that lift-and-shift was little more than a cop-out. Stephen makes the point that this is often the default, best initial first choice: Most of our enterprise customers segment their applications into tranches: those that are easy to move to the cloud, those that are medium hard, and those that are hardest. They also segment applications into those that can easily be lifted and shifted into the cloud and those that need to be re-architected before moving to the cloud.  … I’ve heard a lot of executives—including myself, before I learned better—suggest that they’re only moving to the cloud if they “do it right,” which usually means migrating to a cloud-native architecture. … I’ll hear from senior executives who don’t want to take any of their legacy workloads to the cloud; instead, they want to focus on developing net new architectures using serverless services like AWS Lambda.

When I was the CIO at Dow Jones several years ago, we initially subscribed to the ivory tower attitude that everything we migrated needed to be re-architected, and we had a relentless focus on automation and cloud-native features. That worked fine until we had to vacate one of our data centers in less than two months. [Stephen points out that you gain a quick budget win this way, “which tends to be in the neighborhood of 30 percent when you’re honest about your on-premises TCO.”] …GE Oil & Gas rehosted hundreds of applications to the cloud as part of a major digital overhaul. In the process, they reduced their TCO by 52 percent. Ben Cabanas, then one of GE’s most forward-thinking technology executives, told me a story that was similar to mine—they initially thought they’d re-architect everything, but soon realized that would take too long, and that they could learn and save a lot by rehosting first. …One customer we worked with rehosted one of its primary customer-facing applications in a few months to achieve a 30 percent TCO reduction, then re-architected to a serverless architecture to gain another 80 percent TCO reduction!

He makes the following arguments around lift-and-shift versus a full-on cloud native approach from the get-go:

  1. Time: rehosting takes a lot less time,
  2. Rearchitecture is easier on the Cloud: it becomes easier to re-architect and constantly reinvent your applications once they’re running in the cloud. “I believe the ability of these applications to perform and evolve is just as much dependent on their environment as the code or DNA that governs their behavior. The argument I’d like to make here is that the AWS cloud provides a better environment—in terms of size and diversity of services—that is well beyond what most on-premises data centers can provide.” There’s another example of applying ElasticSearch to cheaply add full-text search capabilities without an expensive and risky move to NoSQL clusters.
  3. Performance and Cost Savings: You realize some immediate benefits. Besides budget/TCO (see above), this also means better performing apps. SSD’s are 2-5x faster than spinning disks for example – so moving a database to SSD-backed instances can yield amazing results for little to nothing. “One customer I know had an application that was in a critical period and realized there were some bad queries causing performance bottlenecks. Changing the code was too risky, so the database server was upped to an X1 instance and then ramped back down to a more reasonable instance size once the critical period was over.”

Pilots and experimentation

Give your teams a hands-on, time-constrained opportunity to do something meaningful to your business with the cloud, and see what happens. Ask them to build a website, create an API for some of your data, host a wiki, or build something else tangible that fits into what your teams already do. I’m always surprised by how quickly the right motivation mixed with a bit of time pressure can lead to results. Scarcity breeds invention.

Innovation comes from experimentation, and because the cloud takes away the need for large up-front investments to try new things, there is nothing holding your team back from creating the next disruptive product in your industry. Give your team some freedom to implement existing projects in new ways.

Generally speaking, I like to see organizations start with a project that they can get results from in a few weeks. …What I’ve found most important is that organizations pick something that will deliver value to the business, but something that isn’t so important that there’s no appetite for learning. The first engineering team you put together should consist of a thorough mix of core skills—Network, Database, Linux Server, Application, Automation, Storage, and Security. The team will make some progress. It will probably look at tools like Terraform and others. It will also write some AWS CloudFormation code. The team will make mistakes. All of this is perfectly natural.

You don’t need access to capital to experiment. Throughout my career, I’ve spent countless hours trying to justify the ROI on a capital investment for resources I thought were needed for a potential product. I was rarely, if ever, able to get capacity planning right, and almost always overbuilt my infrastructure. In a few cases, it took my team longer to justify an investment than it took to build the first version of the product.

…don’t experiment too early in your journey with a project where your stakeholders demand a specific outcome. I wouldn’t advise that you start experimenting with your end-of-year billing run, for instance. A CEO I once worked for told me that it’s okay to fail, except when it isn’t. Be satisfied with incremental progress and slowly increase the number of experiments you run, but don’t outpace the organization.

Make Sure It’s Measurable: DON’T pursue an experiment until you know how to measure it. You want to spend time on the right experiments and ensure the lessons… Mature DevOps organizations also develop A/B testing frameworks that allow them to experiment on slightly different user experiences with different user cohorts in parallel to see what works best. In my brief tenure so far at Amazon, I’ve found that anyone able to think through and articulate an experiment in writing typically gets the opportunity to try it. This is a special part of our culture and a great tool for attracting and retaining innovators and builders.

Audits, ITSM and Security: Your friend, not the enemy?

Stephen points out that our old friends ITIL, ITSM, etc are truly “old” friends – developed in a previous era to standardize the way IT operates in large enterprises. They made sense at the time, but haven’t aged well in the era ofr scalable resources. (i.e. they may be good at controlling costs, but is it worth it if it takes weeks to get a firewall port opened for a resource that can be spun up on-demand in seconds?)

This is somewhat a repeat of the DevSecOps movement / “Shift Left” on security, but he makes a good point:

“Audits are your friend, not your enemy. Use them to educate everyone that you’re better off with the new rules that you’re making and get feedback. Collaborate with your auditors early and often, and explain what you’re trying to accomplish. Get their input and I’m sure they’ll improve your thinking and your results… Once we illustrated that our controls were greatly improved because of the new rules we were employing around automation, our auditors became more comfortable with our future direction. By showing them early that we no longer had ownership spread across siloed teams sitting next to one another but communicating through tickets, and that the opportunity for human mistakes was much less”, resistance dropped.

One of the key points that Stephen makes is that automation can be applied to these (formerly late-stage) audit steps as well. If compliance rules is applied to infrastructure as code, “the compliance team can validate legal and security requirements every time the system is changed, rather than relying on a periodic system review”. 

Do You Have a Cloud Center of Excellence?

Stephen wrote that one of the best decisions he made at Dow Jones was creating a CCoE to codify how their cloud strategy would work and be executed across the org. Here’s some points around creating a COE and making sure it doesn’t become more of a hindrance than a help:

  • Makeup and where to start: I recommend putting together a team of three to five people from a diverse set of professional backgrounds. Try to find developers, system administrators, network engineers, IT operations, and database administrators. These people should ideally be open-minded and eager about how they can leverage modern technology and cloud services to do their jobs differently and take your organization into the future. …Start with the basics: roles and permissions, cost governance, monitoring, incident management, a hybrid architecture, and a security model. Over time, these responsibilities will evolve to include things like multi-account management, managing “golden” images, asset management, business unit chargebacks, and reusable reference.
  • And make it metrics-oriented:
    Organizations that do this well set metrics or KPIs for the CCoE and measure progress against them. I’ve seen metrics range from IT resource utilization, to the number of releases each day/week/month as a sign of increasing agility, to the number of projects the CCoE is influencing. Couple these with a customer-service centric approach, and other business units will want to work with your CCoE because they find value and because the CCoE is a pleasure to work with.
  • Reference architecture: How can you build security and governance into your environment from the very beginning, and rely on automation to keep it up to date? If you can find and define commonalities in the tools and approaches you use across your applications you can begin to automate the installation, patching, and governance of them. You may want one reference architecture across the whole enterprise that still gives business units flexibility to add in what they need in an automated way. Alternatively, you might want multiple reference architectures for different classes or tiers of applications.
  • Start small:
    I encourage companies wanting to shift to a DevOps culture to do so in a DevOps fashion—start with small projects, iterate, learn, and improve. I encourage them to consider implementing strategies that produce commonly accepted practices across the organization, and to begin embracing the idea that, when automated, ongoing operations can be decentralized and trusted in the hands of many teams that will run what they build.
  • Don’t make the CCOE another stage gate:
    Since developers will be the ones most intimately familiar with the nuances of the system, they will likely be able to address issues the fastest. And by using automation, it is easy to methodically propagate changes and roll back or address issues before they impact customers. I encourage centralized DevOps teams to do what they can to make development teams increasingly independent, and not be in the critical path for ongoing operations/releases. …Instead of saying, “You can’t use that to do your job,” ask “What are you trying to accomplish and how can I help you be more effective?” Every time an app team implements a workaround for something the DevOps team can’t deliver, there’s an opportunity for the organization to learn how and why that happened, and decide if they should do things differently moving forward.
  • Product ownership is the end game: Ownership simply means that any individual responsible for a product or service should treat that product or service as his or her own business. Products and services can take any number of forms: a website, a mobile application, the company’s e-mail service, desktop support, a security tool, a CMS, or anything that you deliver to your customer. …I try to encourage executives to make run what you build a crucial tenet.
  • Why clearly defined roles are important: Our programs and teams have a culture that establishes tenets to help guide decisions and provide focus and priorities specific to their area. My recommendation is to define a set of cloud tenets to help guide you to the decisions that make the most sense for your organization. As one of my colleagues at AWS says, “Tenets get everyone in agreement about critical questions that can’t be verified factually.” For example— Do you want application teams to have full reign and control over all the services available in AWS, or should you enforce service standards or provide additional control planes on top of AWS? … First, we broke down the silos by defining a clear IT purpose. Then we thought about the main functions needed to reach our purpose. From there we turned each function into a group by defining the group’s purpose, the group’s domains (what the group owns), and the group’s accountabilities. The next step was to break each group into sub-groups and roles which are needed to reach that group’s purpose. For every sub-group and role, we defined their purpose, domains and accountabilities, and so on.

The Disruptive Power of the Cloud, and avoiding lockin

Stephen talks at length about the power of the cloud as a disruptive force, which he defines as “the on-demand delivery of information technology (IT) resources via the internet with pay-as-you-go pricing”. He mentions that since the inception of the Fortune 500 in 1955, between 20 and 50 companies fall off the list each year. Advances in technology are largely behind this steady rate of turnover, with the cloud being the most recent cause of large-scale disruption.

I particularly enjoyed the disclosure one competitor shared around FUD/ vendor lock-in:

“The only way we can salvage our market share for now is to fuel [fear] because the hard truth is that we simply do not have the arsenal to counter AWS’s dominance. More importantly, we constantly bombard these messages (vendor lock-in, security, et al) with the operational executives that are still (a vast majority in large enterprises) stuck in the traditional IT thinking and their existence threatened by the cloud wave.”

Having worked for many years at organizations that would take months to implement (badly) new infrastructure, I can only agree; faced with nimble cloud competitors, we constantly got static about security and lock-in. In the age of serverless and IAC, it seems like such an anachronism.

That being said, I share some of the concerns around vendor lock-in. Stephen tries to dismiss this, somewhat glibly, countering that well-automated systems are ultimately portable:

What scares me is when companies fall into the trap of trying to architect a single application to work across multiple different cloud providers. I understand why engineers are attracted to this—it is quite an accomplishment to engineer the glue that is required to make different clouds work together. Unfortunately, this effort eats into the productivity gains that compelled the organization to the cloud in the first place… Companies that architect their applications using known automation techniques will be able to reliably reproduce their environments. This best practice is what enables them to take advantage of the elastic properties of the cloud, and will decouple the application from the infrastructure. If done well, it becomes less of a burden to move to a different cloud provider if there is a compelling reason to do so.

Using the Cloud To Fuel Innovation: (from Mark Schwartz): Most enterprises have not optimized for agility. If anything, they have optimized for efficiency – for doing what they do at the lowest cost. … I came to realize that a private cloud is not really a cloud at all, and it certainly is not a good use of company resources. One customer we work with, for example, has developed a business case around developer productivity. The customer (rightfully) believes that by migrating its data centers to AWS, and training its developers in the process, each of its 2,000 developers will be 50 percent more productive than they are today. Driven by the elimination of wait time for infrastructure provisioning—and access to more than 80 services they’d otherwise have to build/procure individually—this productivity boost will lead to an additional 1,000 years of developer capacity…each year. The customer intends to use this additional productivity to fund 100 new projects of 10 people each in an effort to find net new growth opportunities. …We’ve found that as much as 10 percent (I’ve seen 20 percent) of an enterprise IT portfolio is no longer useful, and can simply be turned off.

The real goal of your “digital transformation” – which he says is total BS! – “is not about a transformation that has a finite end state. It’s about becoming an organization that is capable of quickly deploying technology to meet business needs, regardless of where the technology comes from.”

Don’t make a common tenet-writing mistake—creating a tenet that applies to many projects and communicates virtually no information, such as, “We will have world-class cloud capabilities.” Instead, think specific – if the pain point is the ability to provision and manage cloud services as fast as consuming the “native” platform directly—”Provision as fast as with a credit card.” Give your app teams the control / ability to consume cloud services without artificial barriers.

Leadership Is the Differentiator

From Andy Jassy (the CEO of AWS): “the single biggest differentiator between those who talk a lot about the cloud and those who have actual success is the senior leadership team’s conviction that they want to take the organization to the cloud.” He mentions the example of Jamie Miller, who in her cloud migration kickoff announced that GE was going to move 50 apps to AWS over 30 days. This was disruptive, and didn’t meet that aggressive goal initially – but it ended up working.

“… In my experience, it can be fatal if you don’t have the support of a single-threaded executive leader during the transition. This leadership function simply can’t be delegated. The CIO, or, at the very least, a direct report of the CIO has to lead this effort and be visible each and every day to provide direction and remove roadblocks.”

Leadership Beyond Memos: Early in my executive career, I was somewhat naive in thinking that, just because I issued a department-wide directive, everyone’s actions would follow. It wasn’t until I identified the things that were really important and communicated them over and over and over again that they started to stick. … I learned the hard way that this is, of course, not how leadership works. It wasn’t until I started to clearly articulate what was important about our strategy that the behavior of my team started to change. Before presenting a new idea or goal to my team, I had to consider how everyone fit into this strategy and how it tied back to the business and everyone’s careers. Then, I had to capitalize on every opportunity to reinforce these points. This meant talking strategy at quarterly town halls, on internal blogs, during sprint planning sessions, and using every meeting as an opportunity to relate the work being discussed back to our strategy. Sometimes it felt redundant, but the bigger your team is, the less likely each individual regularly hears from you. Remaining determined and being consistent with your communication is key.

Take Into Consideration the Audience: Stephen talks about the different motives/backgrounds of the roles you’ll be engaging with:

  • CFOs are typically attracted to lower up-front costs and the ability to pay only for what you use.
  • CMOs are typically looking to keep the company’s brand fresh and respond to changing market conditions.
  • VPs of HR will want to see that you’re looking after your staff properly and how you’re hiring for new skills.

…most of the hard work at the executive level revolved around understanding each executive’s pain points, what they wanted to get out of IT, and aligning technology to help them meet their goals. After a few months of using the cloud to deliver better results faster, we spent several months retraining the executive team and their departments to refer to us as technology instead of IT. ….I set a goal to take a few executives out for a meal each month. During the time we spent together, I did nothing but listen to their frustrations. I used what I learned to adjust our strategy, and made sure that I communicated back to them how their influence altered our direction.

Technology is Not A Cost Center: I’d argue that today’s IT executive needs to play the role of the Chief Change Management Officer (which I’ll refer to as a CCMO). Technology can no longer be viewed as something that simply supports the business. … a great way for leaders to address this friction is to give everyone on the team clarity around what will happen with their roles. …the role of the CIO and central IT is moving away from command and control, and toward line-of-business enablement. I’m also seeing some organizations … which have taken this one step further in a move toward complete decentralization, where culture and best practice serve as the forcing function that allows teams to operate independently. This trend—trading off consistency for time-to-market—is an important one.

Mainframe and Legacy Systems

The mainframe is often cited as a central point of gravity that stalls or elongates a large cloud migration. Many enterprises feel that they have few people who have the domain and technology expertise to execute a mainframe migration, and those who are still working on the mainframe can be harder to motivate for a cloud migration (though I do believe you already have the people you need). …There are three main approaches to mainframe migrations that we see customers exploring—re-hosting of workloads, batch-job migration and full re-engineering.

Metrics and kpis

A self-motivated, self-grading team: We gave a fixed amount of resources to each line of business and held them accountable to key performance indicators (KPIs) that they set for themselves. Each technology and business owner overseeing a line of business had the ability to move resources around as their customer demand shifted, and we (the leadership team) reviewed KPIs and allocations quarterly to make any necessary changes. …These changes were hard, and there were times when I questioned our approach, thought I’d be fired, or otherwise just thought it would be easier to give up. We were constantly faced with judgment calls that we had to make with incomplete information and unknown risks.

The True Benefits Behind Microservices

….distilled down, it’s clear that their primary benefits are independent deployment and scalability. …An essential area of microservices that’s generally misunderstood is the independence aspect, or how solid the boundaries should actually be between other microservices. … the ramifications can be quite expensive from an infrastructure perspective. (he would commonly ask) – “if I whack your database, how many other teams or microservices will be impacted?”

Why Education Shouldn’t Come Last: The Halo Effect

…I’ve found that finding the ones who aren’t afraid to lead the way (attitude is just as important as aptitude, in many cases) and investing in training and enablement for everyone can be among the most effective ways to get people over their fears. …having access to a seemingly infinite amount of on-demand IT resources can change the game for any organization, provided the culture promotes the use of it. Failure is a lot less expensive when you can just spin down what didn’t work. Educating your staff can turn skeptical employees into true believers, and will make a huge difference in how quickly you’re able to leverage the cloud to deliver results to your business.

Mental health specialists say that acceptance is the first step toward recovery; and that’s totally applicable here, too. Your engineers must accept the fact that they have the ability to learn AWS cloud skills and become experts. It’s also incredibly important for technology leaders within your organization to accept this. As Stephen Orban explains, and as my tenure at Capital One shows, the talent you already have is the talent you need. These are the people who have many years of critical experience developing and running your existing systems.

Reaching Critical Mass: Experience at Capital One and with many of our customers—plus scientific study—has shown that you need to reach a critical mass of 10 percent of engineers advocating a platform before the network effect takes hold. So, scaling this learning and certification to 10 percent of your engineers is a major milestone in your journey. From here onward you get a compelling Halo Effect which starts to influence how your company is seen externally and not just internally. Those engineers externally to your organization who only want to work with Cloud Native companies, will start seriously considering working for you

Adrian Cockroft: An executive once told me “We can’t copy Netflix because we don’t have the people.” My response was “Where do you think they used to work? We hired them from you and got out of their way…”



Postmortems with Teeth… But No Bite!

Jane Miceli of Micron and I are doing a presentation on “Postmortems With Teeth … But No Bite!” at DevOps Days in Boise. We wanted to share an article that can go into more detail than we’ll be able to fit into our 30 minute window. Enjoy! 


It’s been said that a person’s character is revealed when things go wrong. So when things go wrong at your enterprise – what happens? What kind of character does your company show when the chips are down? 

We’re guessing one of two things happen. First is the “outage? What outage?” type of response. It’s possible that your company has NO postmortem process; when failure happens, there might be a few words, but it’s kept informal, within the family. That’s a big mistake, for reasons we’ll go into below. The second and most common is the “rub the puppy’s nose in it” response – where the bad employee(s) that triggered the outage are named, shamed, and blamed. We’d like to spend a few minutes on why both of these common reactions are so harmful, and set you up for better success with a proven antidote -the blameless postmortem.  

Why We Need Postmortems 

[Dave] I tell the story in my book about when I was working for an insurance company. On my way in to work, I stopped by to grab a coffee and a donut (OK, several donuts!) and took a glance at the Oregonian newspaper. I almost spit out my coffee, right there at the counter. There, at the top of the front page, was my company – right where we did NOT want to be. Someone had sent out a mailer, and it had included personal information (names, addresses, DOB, SS#). Worse, many of these mailers ended up in the wrong subscriber’s hands. It was a massive data leak, and there was no place for us to hide from it. I knew the team that had made this mistake – I even knew who’d sent out the mailer. Hmm, I thought, as I headed into the office. We’ve got a long week of damage control ahead of us. I wonder what’s going to happen to Bobby? 

And that’s the interesting part. Nothing happened. There was a few high-level meetings with executives – no engineers or operators allowed in the room of course – on how to best position us and recover from the PR hits we were taking. But while nothing happened to Bobby – which was a good thing, he was just tired and had made a mistake – we didn’t learn anything from it either. No report, no knowledgebase article – it was like nothing had happened. It was only a matter of time until the next time a tired operator triggered yet another leak of sensitive information. 

This type of reaction is understandable, and it’s rooted deep in our psychology. None of us likes to look too closely at our failures or mistakes. But without understanding that mistakes and errors are a normal part of any complex system, we’re missing out on a huge opportunity to learn. And you could make a strong argument that without a postmortem process, any DevOps process is handcuffed. Winning companies that we admire – names like Amazon, Google, Etsy – all make the same mistakes that other companies make. There’s a critical difference though in how they learn from those mistakes, and how they view them.  

Why We Need BLAMELESS Postmortems 

A blameless postmortem focuses on identifying contributing causes of an incident, without calling out any particular individual team for being “bad” or handling things incompetently. It assumes good intentions and that everyone acted in the proper way – given the information, capabilities and processes available at the time. By investigating more into the context behind a failure – what caused that operator to make that decision at 1:30 in the morning? – we can create safer processes. 

And it’s a critical part of several companies DevOps implementations. Google, for example, views blameless postmortems as being a critical part of their culture – so much so that both the excellent “Site Reliability Engineering” and the SRE Handbook have entire chapters on it. Etsy in particular has made some very profound statements on blameless postmortems:  

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!” …Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event… 

Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every “mistake” is seen as an opportunity to strengthen the system. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems. 

…We believe that this detail is paramount to improving safety at Etsy. …If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right? 

There’s a great book called “Barriers and Accident Prevention”  by Erik Hollnagel that deserves more reading than it gets. In it, Erik Hollnagel says the “Bad Apple” theory above – that if we punish or remove the “bad apples” that are causing these failures, that we’ll improve safety – is fundamentally flawed because it assumes bad motives or incompetence: 

We must strive to understand that accidents don’t happen because people gamble and lose. 
Accidents happen because the person believes that: 
…what is about to happen is not possible, 
…or what is about to happen has no connection to what they are doing, 
…or that the possibility of getting the intended outcome is well worth whatever risk there is. 

Accidents Are Emergent; Accidents Are Normal 

The root fallacy here is thinking that accidents are abnormal or an anomaly. Accidents or mistakes are instead a byproduct; they are emergent, a consequence of change and the normal adjustments associated with complex systems. This is the true genius behind the SRE movement begun by Google; instead of striving for the impossible (Zero Defect meetings! Long inquisitor-type sessions to determine who is at fault and administer punishment over any failure!) – they say that errors and mistakes are going to happen, and it is going to result in downtime. Now, how much is acceptable to our business stakeholders? The more downtime (mistakes) we allow – as a byproduct of change – the faster we can innovate. But that extra few 9’s of availability – if the business insists on it – means a dramatic slowdown to any change, because any change to a complex system carries the risk of unintended side effects.  

I’m turning to John Allspaw again as his blog post is (still) unequalled on the topic: 

Of course, for all this, it is also important to mention that no matter how hard we try, this incident will happen again, we cannot prevent the future from happening. What we can do is prepare: make sure we have better tools, more (helpful) information, and a better understanding of our systems next time this happens. Emphasizing this often helps people keep the right priorities top of mind during the meeting, rather than rushing to remediation items and looking for that “one fix that will prevent this from happening next time”. It also puts the focus on thinking about what tools and information would be helpful to have available next time and leads to a more flourishing discussion, instead of the usual feeling of “well we got our fix, we are done now”. 

…We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place. 

So, good postmortems don’t stop at blaming the silly / incompetent / dangerous humans, and recognizes that mistakes and disasters are a normal part of doing business. Our job is to collect as much information as possible so we can provide more information to the people who need it the next time that combination of events takes place, shortening the recovery cycle. 

I remember saying this when I was at Columbia Sportswear, long before I knew what a blameless postmortem was, when something went awry: “I’m OK with making mistakes. I just want to make new and different mistakes.”  

Stopping At Human Causes Is Lazy 

During the postmortem process, the facilitator helps the team drill down a little deeper behind human error: 

… As we go along the logs, the facilitator looks out for so-called second stories – things that aren’t obvious from the log context, things people have thought about, that prompted them to say what they did, even things they didn’t say. Anything that could give us a better understanding of what people were doing at the time – what they tried and what worked. The idea here being again that we want to get a complete picture of the past and focusing only on what you can see when you follow the logs gives us an impression of a linear causal chain of events that does not reflect the reality. 

Etsy didn’t invent that; this comes from the great book “Behind Human Error” by David Woods and Sidney Dekker, which distinguished between the obvious (human) culprits and the elusive “second story” -what caused the humans involved to make a mistake: 

First Stories 

Second Stories 

Human error is seen as cause of failure 

Human error is seen as the effect of systemic vulnerabilities deeper inside the organization 

Saying what people should have done is a satisfying way to describe failure 

Saying what people should have done doesn’t explain why it made sense for them to do what they did 

Telling people to be more careful will make the problem go away 

Only by constantly seeking out its vulnerabilities can organizations enhance safety 

The other giant in the field is Sidney Dekker, who called processes that stop at human error as the “Bad Apple Theory”. The thinking goes that if we get rid of bad apples, we’ll get rid of human-triggered errors. This type of thinking is seductive, tempting. But it simply does not go far enough, and will end up encouraging less transparency. Engineers will stop trusting management, information flow upwards will dry up. Systems will become harder to manage and unstable as less information is shared even within teams. Lacking understanding of the context behind how an incident occurred practically guarantees a repeat incident. 

There Is No Root Cause (The Problem With The Five Whys) 

Reading accounts about any disaster – the 1996 Everest disaster that claimed 8 lives, the Chernobyl disaster, even the Challenger explosion – there is never one single root cause. Almost always, it’s a chain of events – as Richard Cook put it, failures in complex systems require multiple contributing causes, each necessary but only jointly sufficient. 

This goes against our instincts as engineers and architects, who are used to reducing complex problems down as much as possible. A single, easily avoidable root cause is comforting – we’ve plugged the mouse hole, that won’t happen again. Whew – all done! But complex systems can’t be represented as a cherry-picked list of events, a chain of dominoes; pretending otherwise means we trick ourselves into a false sense of security and miss the real lessons.  

The SRE movement is very careful not to stop at human error; it’s also careful not to stop at a single root cause, which is what the famous “Five Whys” linear type drilldown encouraged by Toyota promotes. As the original SRE book put it: 

This is why we focus not on the action itself – which is most often the most prominent thing people point to as the cause – but on exploring the conditions and context that influenced decisions and actions. After all there is no root cause. We are trying to reconstruct the past as close to what really happened as possible. 

Who Needs To Be In The Room? 

Well, you’re going to want to have at least a few people there: 

  • The engineer(s) / personnel most directly involved in the incident 
  • A facilitator 
  • On-call staff or anyone else that can help with gathering information 
  • Stakeholders and business partners 

Why the engineers/operators involved? We mentioned a little earlier the antipattern of business- or executive-only discussions. You want to have the people closest to the incident telling the story as it happens. And, this just happens to be the biggest counter to that “lack of accountability” static you are likely to get. John Allspaw put it best: 

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items. So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.  

…Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures. We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future. 

Why a facilitator? This is a “playground umpire”, someone who enforces the rules of behavior. This person’s job is to keep the discussion within bounds.  

The Google SRE book goes into the psychology behind disasters and the role of language in great detail. But you’re going to want to eliminate the use of counterfactuals: the belief that if only we had known, had done that one thing different, the incident would not have happened – the domino theory. Etsy is very careful to have the facilitator watch for any use of the phrases “would have”, “should have”, etc in writeups and retrospectives: 

Common phrases that indicate counterfactuals are “they should have”, “she failed to”, “he could have” and others that talk about a reality that didn’t actually happen. Remember that in a debriefing we want to learn what happened and how we can supply more guardrails, tools, and resources next time a person is in this situation. If we discuss things that didn’t happen, we are basing our discussion on a reality that doesn’t exist and are trying to fix things that aren’t a problem. We all are continuously drawn to that one single explanation that perfectly lays out how everything works in our complex systems. The belief that someone just did that one thing differently, everything would have been fine. It’s so tempting. But it’s not the reality. The past is not a linear sequence of events, it’s not a domino setup where you can take one away and the whole thing stops from unraveling. We are trying to make sense of the past and reconstruct as much as possible from memory and evidence we have. And if we want to get it right, we have to focus on what really happened and that includes watching out for counterfactuals that are describing an alternative reality. 

Interestingly enough, it’s usually the main participants that are the most prone to falling into this coulda-shoulda-woulda type thinking. It’s the facilitator’s job to keep the discussion within bounds and prevent accusations / self-immolation.  

How To Do Blameless Postmortems Right 

There’s two great postmortem examples we often point to: the first is found in both the SRE books (see the Appendix). The second – which Jane often uses – was a very prominent outage at GitLab, found here.  

A great writeup like this doesn’t come from nowhere. Likely, the teams shared a draft internally – and even had it vetted for completeness by some senior architects/engineers. The reviewers will want to make sure that the account has a detailed timeline, showing the actions taken, what expectations and assumptions were made, and the timeline. They’ll also want to make sure the root cause is deep enough, that information was broadcasted appropriately, and the action items are complete and prioritized correctly.  

If you have an hour long postmortem review, you may spend more than half of that time going over a timeline. That seems like an absurd waste of time, but don’t skip it. During a stressful event, it’s easy to misremember or omit facts. If the timeline isn’t as close as possible to what actually happened, you won’t end up with the right remediation steps. And, it may also expose gaps in your logging and telemetry.  

Once the timeline is set, it’s time to drill down a little deeper. Google keeps the discussion informal but always aimed at uncovering the Second Story: 

This discussion doesn’t follow a strict format but is guided by questions that can be especially helpful, including: “Did we detect something was wrong properly/fast enough?”, “Did we notify our customers, support people, users appropriately?”, “Was there any cleanup to do?”, “Did we have all the tools available or did we have to improvise?”, “Did we have enough visibility?”. And if the outage continued over a longer period of time “Was there troubleshooting fatigue?”, “Did we do a good handoff?”. Some of those questions will almost always yield the answer “No, and we should do something about it”. Alerting rules, for example, always have room for improvement. 

The Postmortem Playbook 

[Jane] When I started to be on call, I had a lot of questions. Especially after the adrenaline rush of an incident is over and the thought of the mounting paperwork to come. And even worse, now I get to pick apart everything I’ve done for an audience. Once an incident happened, I asked my manager/team mates a lot of questions. How does one really facilitate a good retrospective? What exactly does it look like? 

What template do you use? 

I usually use some form of a template here. What I usually keep from it often depends on the tolerance for the paperwork in the company or team. I usually keep the intro pieces, timeline, customer impact and action items as a minimum.  

How do you start? What words do you choose? 

Here is exactly what to say at the beginning to set the expectations and rules of engagement: 

“Hi All. Thank you for coming. We’re here to have a post mortem on <Title>. This is a blameless retrospective. This isn’t a meeting to assign blame and find the next scape goat. We want to learn something. That means we aren’t focused on what we could’ve/should’ve/would’ve done. We use neutral language instead of inflammatory language. We focus on facts instead of emotions, intent or neglect. All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item. We want to capture both things we need to change, and what new genius ways we’ve stumbled upon. We even want to capture we’re we’ve been lucky. Our agenda is to understand these working agreements, customer impact, focus on the timeline, contributing factors to failure and action items. Everyone is expected to update documentation and participate. We value transparency, and this will be published, without individual names of course. Let’s get started….”  

What does your meeting invite look like? 

Title: “Post Mortem for Incident 2019 Jan 1 at 7 UTC” or a “Post Mortem for Incident 2019 Jan 1 HelloWorld Outage” 

What’s in the body of the message? 


Let’s have a phone call on the retrospective related to the <Incident Title used in Subject>. 

Please forward as you see appropriate.  

Prep work should be added and filled out before the start of the retrospective [here|link] 

  • Read through the post mortem 
  • Please help add timeline details and events. Sources for timeline artifacts may be phone calls, email, texts, chats, alerts, support desk tickets, etc and converted to UTC 
  • Proposed action items to take 

1.      This is a blameless retrospective.  

2.      We will not focus on the past events as they pertain to “could’ve”, “should’ve”, etc.  

3.      All follow up action items will be assigned to a team/individual before the end of the meeting. If the item is not going to be top priority leaving the meeting, don’t make it a follow up item.  

<Information for conference bridge> 

When is it scheduled? 

Within 2-3 business days of the end of the incident. 

What prework/homework do I do? 

As the person who was on call, immediately capture all logs bridge logs, call/page records, alert detection times, escalation times, actions taken and time of action, system logs, chat logs, etc and out in a time line. There may be some time conversations to a standard format date and time for your timeline. Put it all un UTC as a standard. Not all information is relevant, but it’s useful to have if called upon to add to the timeline. 

What’s the facilitator objectives? 

  • Set expectations of blameless retrospective. 
  • Talk about impact to customers/partners. 
  • Present timeline and walk through it. 
  • Get agreement on timeline. 
  • Talk about what went well. 
  • Get agreement on action items. 
  • Assign action items to people/teams.  
  • Keep the playground fair. Do not allow a blame/shame statement to stand. 

What’s the follow up for the facilitator? 

Publish the report per your companies’ policies and choose max privilege vs least in the context. 

Send report and items to customers. 

Make sure it’s logged in the post mortem log history. Do not create a blamebase. 

Update with links to features/bugs/user stories are added for traceability and transparency. 

What Makes For A Good Action Item?  

Action items are how you complete the loop – how you give a postmortem teeth, so to speak.  

Interestingly, Etsy finds its usually comes down to making more and higher quality information available to those on the scene via metrics, logging, dashboarding, documentation, and error alerts – i.e. building a better guardrail:  

There is no need (and almost certainly no time) to go into specifics here. But it should be clear what is worthy of a remediation item and noted as such. Another area that can almost always use some improvement is metrics reporting and documentation. During an outage there was almost certainly someone digging through a log file or introspecting a process on a server who found a very helpful piece of information. Logically, in subsequent incidents this information should be as visible and accessible as possible. So it’s not rare that we end up with a new graph or a new saved search in our log aggregation tool that makes it easier to find that information next time. Once easily accessible, it becomes a resource so anyone can either find out how to fix the same situation or eliminate it as a contributing factor to the current outage. 

…this is not about an actor who needs better training, it’s about establishing guardrails through critical knowledge sharing. If we are advocating that people just need better training, we are again putting the onus on the human to just have to know better next time instead of providing helpful tooling to give better information about the situation. By making information accessible the human actor can make informed decisions about what actions to take. 

Ben Treynor, the founder of SRE, said the following: 

A postmortem without subsequent action is indistinguishable from no postmortem. Therefore, all postmortems which follow a user-affecting outage must have at least one P[01] bug associated with them. I personally review exceptions. There are very few exceptions. 

Vague or massive bowling-ball sized to-do’s are to be avoided at all cost; these are often worse than no action item at all. Google and Etsy both are very careful to make sure that action items follow the SMART criteria – actionable, measurable, relevant. In fact, Google has a rule of thumb that any remediation action item should be completed in 30 days or less; if these action items linger past that, they’re revisited and either rewritten, reprioritized, or dropped.  

Completing the Loop 

Once the report is written up and finalized – and available to all other incident responders for learning – you’re not quite done yet. Google for example tells of a story where an engineer that caused a high-impact incident was commended and even given a small cash reward for quick mitigation: 

Google’s founders Larry Page and Sergey Brin host TGIF, a weekly all-hands held live at our headquarters in Mountain View, California, and broadcast to Google offices around the world. A 2014 TGIF focused on “The Art of the Postmortem,” which featured SRE discussion of high-impact incidents. One SRE discussed a release he had recently pushed; despite thorough testing, an unexpected interaction inadvertently took down a critical service for four minutes. The incident only lasted four minutes because the SRE had the presence of mind to roll back the change immediately, averting a much longer and larger-scale outage. Not only did this engineer receive two peer bonuses82 immediately afterward in recognition of his quick and level-headed handling of the incident, but he also received a huge round of applause from the TGIF audience, which included the company’s founders and an audience of Googlers numbering in the thousands. In addition to such a visible forum, Google has an array of internal social networks that drive peer praise toward well-written postmortems and exceptional incident handling. This is one example of many where recognition of these contributions comes from peers, CEOs, and everyone in between. 

We’ve seen a couple great examples of companies using the incident report and postmortem process to help with their DR role playing exercises, sharing incident writeups in a monthly newsletter or for group discussions. But visibly rewarding people for doing the right thing – as Google handled the situation above – is about as far as you can get from the “rub the puppy’s nose in it” antipattern. We think you’ll create a safer organization when you foster a postmortem process that encourages sharing information and understanding context – versus naming, shaming, and blaming.  


Jane Miceli 


Today, I am a Cloud Enterprise Architect at Micron Technology. Before Micron, most recently I lead a Cloud SRE team at HP Inc. I’ve got 17 years’ experience working at companies like Rockwell Automation, HP, Bodybuilding,com, Sensus (now Xylem), Silverback Learning Solutions, and now Micron. The earliest experience I’ve had at a company using the cloud was in 2010. In the 9 years since, I’ve had a lot of failures along the way. I talk about them, so others don’t repeat them and hopefully make new ones to share with me. The accolades I consider failures are the times I’ve run into the same situation and didn’t change my behavior. I endeavor to always find new ways to fail. 

Dave Harrison 

I’m a Senior Application Development Manager (ADM) working for Microsoft Premier. As a development lead and project manager, I’ve spearheaded cultural revolutions in several large retail and insurance organizations making the leap to Agile and Continuous Delivery. An enthusiastic promoter of Azure DevOps, Chef, Puppet, Ansible, Docker, and all other tools – he believe very firmly that, as with Agile, the exact tool selected is less important than having the people and processes in place and ready. On a personal note, I’m the proud father of two beautiful girls and have been married to my lovely wife Jennifer for 24 years, and am based out of Portland, Oregon, USA. I enjoy fishing, reading history books, and in my spare time often wonder if I should be doing more around the house versus goofing off. I’m on LinkedIn, post to my blog semi-frequently, and – thanks to Jane! – am belatedly on Twitter too… 



  1. Resilience Engineering, Hollnagel, Woods, Dekker and Cook, 
  2. Hollnagel’s talk, On How (Not) To Learn From Accidents – 
  3. Sidney Dekker, The Field Guide to Understanding Human Behavior, 
  4. Morgue software tool – 
  6. “Practical Postmortems at Etsy”, Daniel Schauenberg, 
  7. John Allspaw, “Blameless PostMortems and a Just Culture”, 
  8. Chapter 15, Portmortem Culture: Learning From Failure (Google SRE book – The discussions on hindsight and outcome bias are particularly valuable.  
  9. Great postmortem example: (I love the detailed timeline. ) 
  10. Sample (bogus!) postmortem entry here: Note the sections on Lessons Learned: What went well, what went wrong, where we got lucky. There’s an extensive timeline and a link to supporting info (i.e. the monitoring dashboard). Impact, summary, root causes, trigger, resolution, detection. And then a list of action items and their status.  



DevOps Practices – Part 1, Spotify.

Got 10 minutes?

We’re celebrating the upcoming launch of our book by putting out a series of videos covering that thorniest of issues – culture. There’s a lot to be learned from the companies that have been able to make DevOps work.

For example, take Spotify. They’ve been able to instill a risk-friendly environment, centered around the concept of autonomous teams called squads. (There’s also tribes and guilds, but that’s another story!)

Click any of the images below to watch the video:



DevOps Stories – Interview with Nigel Kersten of Puppet

Nigel came to Puppet from Google HQ in Mountain View, where he was responsible for the design and implementation of one of the largest Puppet deployments in the world. At Puppet, Nigel was responsible for the development of the initial versions of Puppet Enterprise and has since served in a variety of roles, including head of product, CTO, and CIO. He’s currently the VP of Ecosystem Engineering at Puppet. He has been deeply involved in Puppet’s DevOps initiatives, and regularly speaks around the world about the adoption of DevOps in the enterprise and IT organizational transformation.

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in mid 2019 and available now for pre-order!

The Deep End of the Pool

I grew up in Australia; I was lucky enough to be one of those kids that got a computer. It turns out that people would pay me to do stuff with them! So I ended up doing just that – and found myself at a local college, managing large fleets of Macs and handling a lot of multimedia and audio needs there. Very early in my career, I found hundreds of people – students and staff – very dependent on me to be The Man, to fix their problems. And I loved being the hero – there’s such a dopamine hit, a real rush! The late nights, the miracle saves – I couldn’t get enough.

Then the strangest thing happened – I started realizing there was more to life than work. I started getting very serious about music, to the point where I was performing. And I was trying a startup with a friend on the side. So, for a year or two, work became – for the first time – just work. Suddenly I didn’t want to spend my life on call, 24 hours a day – I had better things to do! I started killing off all my manual work around infrastructure and operations, replacing it with automation and scripts.

That led me to Google, where I worked for about five years. I thought I was a scripting and infrastructure ninja – but I got torn to shreds by the Site Reliability Engineers there. It was a powerful learning experience for me – I grew in ways I couldn’t have anywhere else. For starters, it was the deep end of the pool. We had a team of four managing 80,000 machines. And these weren’t servers in a webfarm – these were roaming laptops, suddenly appearing on strange networks, getting infected with malware, suffering from unreliable network connections. So we had to automate – we had no choice about it. As an Ops person, this was a huge leap forward for me – it forced me to sink or swim, really learn under fire.

Then I left for Puppet – I think I was employee #13 there – now we’re at almost 500 and growing. I’m the Chief Technical Strategist, but that’s still very much a working title – I run engineering and product teams, and handle a lot of our community evangelism and architectural vision. Really though it all comes down to trying to set our customers up for success.

Impoverished Communication

I don’t think our biggest challenge is ever technical – it’s much more fundamental than that, and it comes down to communication. There’s often a real disconnect between what executives think is true – what they are presenting at conferences and in papers – and what is actually happening on the ground. There’s a very famous paper from the Harvard Business Review back in the 70’s that said that communication is like water. Communication downwards is rarely a problem, and it works much better than most managers realize. However, open and honest communication up the chain is hard, like trying to pump water up a hill. It gets filtered or spun, as people report upwards what their manager wants to believe or what will reflect well on them – and next thing you know you have an upper management layer that thinks they are well informed but really is in an echo chamber. Just for example, take the Challenger shuttle disaster – technical data that clearly showed problems ahead of the explosion were filtered out, glossed over, made more optimistic for senior management consumption.

We see some enterprises out there struggling and it becomes this very negative mindset – “oh, the enterprise is slow, they make bad decisions, they’re not cutting edge.” And of course that’s just not true, in most cases. These are usually good people, very smart people, stuck in processes or environments where it’s difficult to do things the right way. Just for example, I was talking recently to some very bright engineers trying to implement change management, but they were completely stuck. This is a company that is about 100,000 people – for every action, they had to go outside their department to get work done. So piecemeal work was killing them – death by a thousand cuts.

Where To Start

In most larger enterprises aiming for complete automation, end to end, is somewhat of a pipe dream – just because these companies have so many groups and siloes and dependencies. But that’s not saying that DevOps is impossible, even in shared services type orgs. This isn’t nuclear science, it’s like learning to play the piano. It doesn’t require brilliance, it’s not art – it’s just hard work. It just takes discipline and practice, daily practice.

I have the strong impression that many companies out there SAY they are doing DevOps, whatever that means – but really it hasn’t even gotten off the ground. They’re still on Square 1, analyzing and trying to come up with the right recipe or roadmap that will fit every single use case they might encounter, past present and future. So what’s the best way forward if you’re stuck in that position?

Well, first off, how much control do you have over your infrastructure? Do you have the ability to provision your VM’s, self-service? If so you’ve got some more cards to play with. Assuming you do – you start with version control. Just pick one – ideally a system you already have. Even if it’s something ancient like Subversion – if that’s what you have, use it as your one single source of truth. Don’t try to migrate to latest and greatest hipster VC system. You just need to be able to programmatically create and revert commits. Put all your shell scripts in there and start managing your infrastructure from there, as code.

Now you’ve got your artifacts in version control and you’re using it as a single repository, right? Great – then talk to the people running deployments on your team. What’s the most painful thing about releases? Make a list of these items, and pick one and try to automate it. And always prioritize building blocks that can be consumed elsewhere. For example, don’t attempt to start by picking a snowflake production webserver and trying to automate EVERYTHING about it – you’ll just end up with a monolith of infrastructure code you can’t reuse elsewhere, your quality needle won’t budge. No, instead you’d want to take something simple and in common and create a building block out of it.

For example, time synchronization – it’s shocking, once you talk to Operations people, how something so simple and obvious as a timestamp difference between servers can cause major issues – forcing a rollback due to cascading issues or a troubleshooting crunch because the clocks on two servers drifted out of synch and it broke your database replication. That’s literally fixed in Linux by installing a single package and config. But think about the reward you’ll get in terms of quality and stability with this very unglamorous but fundamental little shift.

Take that list and work on what’s causing pain for your on-call people, what’s causing your deployments to break. The more you can automate this, the better. And make it as self-service as possible – instead of having the devs fire off an email to you, where you create a ticket, then provision test environments – all those manual chokepoints – wouldn’t it be better to have the devs have the ability to call an API or click on a website button and get a test environment spun up automatically that’s set up just like production? That’s a force multiplier in terms of improving your quality right at the get-go.

 Now you’ve got version control, you can provision from code, you can roll out changes and roll them back. Maybe you add in inventory and discoverability of what’s actually running in your infrastructure. It’s amazing how few organizations really have a handle on what’s actually running, holistically. But as you go, you identify some goals and work out the practices you want to implement – then choose the software tool that seems the best fit.

Continuous Delivery Is The Finish Line

The end goal though is always the same. Your target, your goal is to get as close as you can to Continuous Integration / Continuous delivery. Aiming for continuous delivery is the most productive single thing an enterprise can do, pure and simple. There’s tools around this – obviously working for Puppet I have my personal bias as to what’s best. But pick one, after some thought – and play with it. Start growing out your testing skills, so you can trust your release gates.

With COTS products you can’t always adopt all of these practices – but you can get pretty close, even with big-splash, multi-GB releases. For example, you can use deployment slots and script as much as you can. Yes, there’s going to be some manual steps – but the more you can automate even this, the happier you’ll be.

Over time, kind of naturally, you’ll see a set of teams appear that are using CI/CD, and automation, and the company can point to these as success stories. That’s when an executive sponsor can step in and set this as a mandate, top down. But just about every DevOps success story we’ve seen goes through this pioneering phase where they’re trying things out squad by squad and experimenting – that’s a good thing. You can’t skip this, no more than a caterpillar can go right to being a butterfly.

DevOps Teams

At first I really hated the whole DevOps Team concept – and in the long term, it doesn’t make sense. It’s actually a common failure point – a senior manager starts holding this “A” team up as an example. This creates a whole legion of haters and enemies, people working with traditional systems who haven’t been given the opportunity to change like the cool kids – the guys always off at conferences, running stuff in the cloud, blah blah. But in the short term it totally has its place. You need to attach yourself to symbols that makes it clear you’re trying to change. If you try to boil the ocean or spin it out with dozens of teams, it gets diluted and your risk rises, it could lose credibility. Word of mouth needs to be in your favor, kind of like band t-shirts for teenagers. So you can start with a small group initially for your experiments – just don’t let it stay that way too long.

But what if you DON’T have that self-provisioning authority? Well there’s ways around that as well. You see departments doing things like doing capacity planning and reserving large pools of machines ahead of time. That’s obviously suboptimal and it’s disappearing now that more people are seeing what a powerful game-changer the cloud and self-provisioned environments are. The point is – very rarely are we completely shackled and constrained when it comes to infrastructure.

Automation and Paying Off Technical Debt

It’s all too easy to get bogged down in minutiae when it comes to automation. I said earlier that DevOps isn’t art, it’s just hard work – and that’s true. But focus that hard work on the things that really matter. Your responsibility is to make sure you guard your time and that of the people around you. If you’re not careful, you’ll end up replacing this infinite backlog of manual work you have to do with an infinite amount of tasks you need to automate. That’s really demoralizing, and it really hasn’t made your life that much better!

Let’s take the example of a classic three-tier web app you have onprem. And you’ve sunk a lot of time into it so that now it fails every week versus every 6 months – terrific! But for that next step – instead of trying to automate it completely end to end, which you could do – how could you change it so that its more service oriented, more loosely coupled, so your maintenance drops even more and changes are less risky? Maybe building part of it as a microservice, or putting up that classic Martin Fowler strangler fig, will give you this dramatic payoff you would never get with grinding out automation for the sake of automation and never asking if there’s a better way.

Paying off technical debt is a grind, just like paying off your credit card and paying off the mortgage. Of course you need to do that – but it shouldn’t be all you do! Maybe you’ll take some money and sink it into an investment somewhere, and get that big boost to your bottom line. So instead of mindlessly just paying off your technical debt, realize you have options – some great investment areas open to you – that you can invest part of your effort in.

Optimism Bias and Culture

This brings us right back to where we started, communication. There is a fundamental blind spot in a lot of books and presentations I see on DevOps, and it has to do with our optimism bias. DevOps started out as a grassroots, community driven movement – led and championed by passionate people that really care about what they’re doing, why they’re doing it. Pioneers like this are a small subset of the community though – but too often we assume ‘everyone is just like us’! What about the category a lot of people fall in – the ones who just want to show up, do their job, and then go home? If we come to them with this crusade for efficiency and productivity, it just won’t resonate with the 9 to 5 crowd. They like the job they have – they do a lot of manual changes, true, but they know how to do it, it guarantees a steady flow of work and therefore income, and any kind of change will not be viewed as an improvement – no matter how you try to sell it. You could call this “bad”, or just realize that not everyone is motivated by the same things or thinks the same way. In your approach, you may have to mix a little bit of pragmatism in with that DevOpsy-starry eyed idealism – think of different ways to reach them, work around them, or wait for a strong management drive to collapse this kind of resistance.

DevOps Stories – Interview with John Weers of Micron

John Weers is Senior Manager of DevOps and Software Quality at Micron. He works to build highly capable teams that trust each other, build high quality software, deliver value with each sprint and realize there’s more to life than work.

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in mid 2019 and available now for pre-order!

Kickstarting a DevOps Culture

Some initial background – I lead on a team of passionate DevOps engineers/managers who are tasked with making our DevOps transformation work.   While our group is only officially about 5 months old, we’ve all been working this separately for quite a while.

About every two weeks we have a group of about 15 DevOps experts that get together and talk – we call them the “design team”.  That’s a critical touch point for us – we identify some problems in the organization, talk about what might be the best practice for them, and then use that as a base in making recommendations. So that’s how we set up a common direction and coordinate; but we each speak for and report to a different piece of the org. That’s a very good thing – I’d be worried if we were a separate group of architects, because then we’d get tuned out as “those DevOps guys”. It’s a different thing altogether if a recommendation is coming from someone working for the same person you do!

We’ve made huge strides when it comes to being more of a learning-type organization – which means, are we risk-friendly, do we favor experimentation? When there’s a problem, we’re starting to focus less on root cause and ‘how do we prevent this disaster from happening again’ – and more on, what did we learn from this? I see teams out there trying new things, experimenting with a new tool for automation – and senior management has responded favorably.

Our movement didn’t kick off with a bang. About 5 years ago, we came to the realization that our quality in my area of IT was poor. We knew quality was important, but didn’t understand how to improve it. Some of the software we were deploying was overly complex and buggy. In another area, the issue wasn’t quality but time – the manual test cycle was too long, we’re talking weeks for any release.

You can tell we’re making progress by listening to people’s conversations – it’s no longer about testing dates or coverage percentages or how many bugs we found this month, but “how soon can we get this into production?” – most of the fear is gone of a buggy release as we’ve moved up that quality curve. But it has been a gradual thing. I talked to everyone I could think of at conferences, about their experiences with DevOps. It took a lot of trial and error to find out what works with our organization. No one that I know of has hit on the magical formula right off the bat; it takes patience and a lot of experimentation.

Start With Testing

Our first effort was to target testing – automated testing, in our case using HP’s UFT and Quality Center platform. But there never was an all-hands-on-deck, call to “Do DevOps!” – that did happen, but it came two years later. We had to lay down the groundwork by focusing first on quality, specifically testing.

We’re five years along now and we are making progress, but don’t kid yourself that growth or a change in mindset happens overnight. Just the phrase “Shift Left” for example – we did shift our quality work earlier in the development process by moving to unit testing and away from UI/Regression testing. We found that it decreased our bugs in production by a very significant amount.

We went through a few phases – one where we had a small army of contractors doing test automation and regression testing against the UI layer. Quality didn’t improve, because of the he-said/she-said type interactions between the developers and QA teams in their different siloes. We tried to address interactions between different applications and systems with integration testing, and again found little value. The software was just too complex. Then we reached a point where we realized the whole dynamic needed to be rethought.

So, we broke up the QA org in its entirety, and assigned QA testers on each of our agile teams and said – you guys will sink or swim as a team. Our success with regression testing went up dramatically, once we could write tests along with the software as it was being developed.  Once a team is accountable for their quality, they find a way of making it happen.

We got resistance and kickback from the developers, which was a little surprising. There was a lot of complaint when we first started requiring developers to write unit tests along with their code of it not being “value added” type activity. But we knew this was something that was necessary – without unit tests, by the time we knew there was a problem in integration or functional testing, it would often be too late to fix it in time before it went out the door.

So, we held the line and now those teams that have a comprehensive unit testing suite are seeing very few errors being released to production.  At this point, those teams won’t give up unit testing because it’s so valuable to them.

“Shift Left” doesn’t mean throwing out all your integration and regression testing. You still need to do a little testing to make sure the user experience isn’t broken. “Shift Left” means test earlier in the process, but in my mind it also means that “our team” owns our quality.

Culture and Energy are the Limiting Points

If you want to “Do DevOps” as a solo individual, you’ll fail.   You need other experts around you to share the load and provide ideas and help.  A group is stronger than any individual.

Can I say – the tool is not the problem, ever? It’s always culture and energy. What I seem to find is, we can make progress in any area that I or another DevOps expert can personally inject some energy into. If I’m visible, if I talk to people, if I can build a compelling storyline – we make rapid progress. Without it, we don’t. It’s almost like starting a fire – you can’t just crumple up some newspaper, dump some kindling on it, light a match and walk away. You’ve got to tend it, constantly add material or blow on it to get something going.

We’re spread very thin; energy and time are limited, and without injecting energy things just don’t happen. That’s a very common story – it’s not that we’re lazy, or bad, or stupid – we work very hard, but there’s so much work to be done we can’t spare the cycles to look at how we’re going about things. Sometimes, you need an outside perspective to provide that new idea, or show a different way.

Lead By Listening

One of the base principles of DevOps is to find your area of pain and devote cycles into automating it. That removes a lot of waste, human defects, errors when you’re running a deployment. But that doesn’t resonate when I work with a team that’s new to DevOps. I don’t walk in there with a stone tablet of commandments, “here’s what you should do to do DevOps”. That’s a huge turn-off.

Instead, I start by listening. I talk to each team ask them how they go about their work, what they do, how they do it. Once we find out how things are working, we can also identify some problems – then we can come in and we can talk about how automation can address that problem in a way that’s specific to that team, how DevOps can make their world better. They see a better future and they can go after it.

Tools as an Incentive

I just said the tool isn’t the problem, but that doesn’t mean it’s not a critical part of the solution. I’m a techie at heart and I like a shiny new tool just as much as the next person. You can use tools as incentives to get new changes rolling. It’s a tough sell to walk into a meeting and pitch unit testing as a cure to quality issues if they take a long time to write. But if we talk about using Visual Studio Enterprise and how it makes unit tests simple and it’s able to run them real time, now it becomes easier to do unit testing than to test the old way. If we can show how these tools can shrink testing to be an afterthought instead of a week, now we have your attention!

About a year ago, our CIO set a mandate for the entire organization to excel at both DevOps and Agile. But the architecture wasn’t defined, no tools were specified. Which is terrific – DevOps and Agile is just a way of improving what we can do for the business. We now see different teams having different tech stacks and some variation in the tools based on what their pain point is and what their customers are needing.  As a rule, we encourage alignment where it makes sense around either a technology stack or with a common leader. That provides enough alignment that teams can learn from each other and yet look for better ways of solving their issues.

The rule is that each main group in IT should favor a toolchain, but should choose software architecture that fits their business needs.  In one area, for example, the focus is on getting changes into production as fast as possible. This is the cutting edge of the blade, so automation and fast turnaround cycles are everything. For them, microservices are a terrific option and the way that their development happens – it fits the business outcomes they want.

Do You Need the Cloud?

They’ll tell you that DevOps means the cloud; you can’t do it without rapid provisioning which means scalable architecture and massive cloud-based datacenters. But we’re almost 100% on-prem. For us, we need to keep our software, especially R&D, privately hosted. That hasn’t slowed us down much.   It would certainly be more convenient to have cloud-based data centers and rapid provisioning, but it’s not required by any means.

Metrics We Care About

We focus on two things – lead time (or cycle time in the industry) and production impact. We want to know the impact in terms of lost opportunity – when the fab slows down or stops because of a change or problem. That resonates very well with management, it’s something everyone can understand.

But I tell people to be careful about metrics. It’s easy to fall in love with a metric and push it to the point of absurdity! I’ve don’t this several times. We’ve dabbled in tracking defects, bug counts, code coverage, volume of unit testing, number of regression tests – and all of them have a dark side or poor behavior that is encouraged. Just for example, let’s say we are tracking and displaying volume of regression tests. Suddenly, rather than creating a single test that makes sense, you start to see tests getting chopped up into dozens of tests with one step in them so the team can hit a volume metric. With bug counts – developers can classify them as misunderstood requirement rather than admitting something was an actual bug. When we went after code coverage, one developer wrote a unit test that would bring the entire module of code under test and ran that as one gigantic block to hit their numbers.

We’ve decided to keep it simple – we’re only going to track these 2 things – cycle time and production impact – and the teams can talk individually in their retrospectives about how good or bad their quality really is. The team level is also where we can make the most impact on quality.

I’ve learned a lot about metrics over the years from Bob Lewis’ IS Survivor columns.  Chief among those lessons is to be very, very careful about the conversation you have with every metric.  You should determine what success looks like, and then generate a metric that gives you a view of how your team is working.  All subsequent conversations should be around “if we’re being successful” and not “are we achieving the metric.”   The worst thing that can happen is that I got what I measured.

PMO Resistance

Sometimes we see some resistance from the BSA/PM layer. That’s usually because we’re leading with our left foot – the right way is to talk about outcomes. What if we could get code out the door faster, with a happier team, with less time testing, with less bugs? When we lead with the desired outcome, that middle layer doesn’t resist, because we’re proposing changes that will make their lives easier.

I can’t stress this enough – focus on the business outcomes you’re looking for and eliminate everything else. Only pursue a change if the outcome fits one of those business needs.

When we started this quality initiative, initially our release cycle averaged – I wish I was exaggerating – about 300 days. We would invest a huge amount of testing at every site before we would deploy. Today, we have teams with cycle times under 10 days. But that speed couldn’t happen unless our quality had gone up. We had to beef up our communication loop with the fab so if there was a problem we can stop it before it gets replicated.

The Role of Communication

You can’t overstate credibility. As we create less and less impact with changes we deploy, our relationship with our customers in the business gets better and better. Just for example, three years ago we had just gone through a disastrous communication tool patch that had grounded an entire site for hours.  We worked through the problems internally and then I came to a plant IT director a year later and told them that we thought the quality issues were taken care of and enlisted their help.

Our next deployment required 5 minutes of downtime and had limited sporadic impact.  And that’s been the last real impact we’ve had during software deployment for this tool in almost 3 years – now our deployments are automated and invisible to our users. Slowly building up that credibility and a good reputation for caring about the people you’re impacting downstream has been a big part of our effort.

Cross-Functional Teams

It’s commonly accepted that for DevOps to work you must be cross-functional. We are like many other companies in that we use a Shared Services model – we have several agile teams that include development, QA roles, an infrastructure team, and Operations which handles trouble tickets from the sites – each with their own leader. This might be a pain point in many companies, but for us it’s just how we work. We’ve learned to collaborate and share the pain so that we’re not throwing work over the fence. It’s not always perfect, but it’s very workable.

For example, in my area every week we have a recap meeting which Ops leads, where they talk about what’s been happening in production and work out solutions with the dev managers in the room. In this way the teams work together and feel each other’s pain. We’re being successful and we haven’t had to break up the company into fully cross-functional groups.

Purists might object to this – we haven’t combined Development and Operations, so can we really say that we are “doing DevOps”? If it would help us drive better business outcomes, that org reshuffling would have happened. But for us, since the focus is on business outcomes, not on who we report to, our collaboration cross team is good and getting better every day. We’re all talking the same language, and we didn’t have to reshuffle. We’re all one team. The point is to focus on the business outcomes and if you need to reorg, it will be apparent when teams talk about their pain points.

If It Comes Easy, It Doesn’t Stick

Circling back to energy – sometimes I sit in my office and wish that culture was easier to change. It’d be so great if there was a single metric we could align on, or a magical technique where I could flip a switch and everyone would get it and catch fire with enthusiasm. Unfortunately, that silver bullet doesn’t exist.

Sometimes I listen to Dave Ramsey on my way in to work – he talks about changing the family tree and getting out of debt. Something he said though resonated with me – “If it comes easy, it doesn’t stick.” If DevOps came easy for us, it wouldn’t really have the impact on our organization that we need. There’s a lot of effort, thought, suffering – pain, really – to get any kind of outcome that’s worth having.

As long as you focus on the outcome, I believe DevOps is a fantastic thing for just about any organization. But, if you view it as a recipe that you need to follow, or a checklist – you’re on the wrong track already, because you’re not thinking about outcomes. If you build from an outcome that will help your business and think backwards to the best way of reaching that outcome – then DevOps is almost guaranteed to work.