Hey guys the 2018 State of DevOps report from Puppet/DORA is out! As always, those guys have done an amazing job. You owe it to yourself to download it and check it over, and pass it along.
Here’s the points I found most powerful:
- DevOps isn’t a fad; it’s proven to make companies faster and less wasteful in producing new features.
- Slower is not safer. Companies releasing every 1-6 months had abysmally slow recovery times.
- We can’t eliminate toil or manual work completely – but in low performing companies, it’s basically all we do. High-performers rarely have it make more than 30% of the workday.
- Outsourcing an entire function – like QA, or production support – remains a terrible idea. It represents a dramatic cap on innovation and ends up costing far more in delays than you’ll ever see with saved operational costs.
- “Shift Left” on security continues to grow in popularity – because it works. The best examples are where implementing it early is made as easy and effortless as possible.
More below. Check it out for yourself, it’s such great work and very easy to read!
The difference between the greats and the not-so-great continues to widen: We’ve heard executives describe DevOps as being a “buzz word” or a “fad”. Ten years into this movement, this seems more and more out of touch with reality. Companies that take DevOps seriously as part of their DNA perform better. They deploy code 46x more frequently; they’re faster to innovate (2,555 times faster lead time). And they do it more safely. Elite performers have 7x lower change failure rate, and can recover 2,604x faster.
DevOps has been proven to lead to faster innovation and change AND produce higher quality work. Honestly, does that sound like a fad to you? (I wonder sometimes if the GM and Chrysler execs in the 1970’s were saying the same thing about Toyota…)
(above image and all others copyright Puppet/DORA 2018)
Releasing infrequently for “safety” is anything but. Many organizations gate releases so they’re spread out over weeks or months, in an attempt to prevent bugs or defects. This backfires terribly; while bug rates may drop, it means their time to recover is disastrously slow. For example, companies that release every 1-6 months have the exact same MTTR – 1-6 months. (!!!!)
“When failures occur, it can be difficult to understand what caused the problem and then restore service. Worse, deployments can cause cascading failures throughout the system. Those failures take a remarkably long time to fully recover from. While many organizations insist this common failure scenario won’t happen to them, when we look at the data, we see five percent of teams doing exactly this—and suffering the consequences. At first glance, taking one to six months to recover from a system failure seems preposterous. But consider scenarios where a system outage causes cascading failures and data corruption, or when multiple unknown systems are compromised by intruders. Several months suddenly seems like a plausible timeline for full recovery.”
Toil and manual work: Elite and high performing orgs do far less manual work. Just look at the percent of people’s time wasted in low performing orgs doing things like hacking out manual configs on a VM, or smoketesting, or trying to push a deployment out the door using Xcopy. Someone on an elite, high performing company might spend 20-30% of their time doing this type of shovel work; at lower performing companies, it’s basically 100% plus of their time.
Think Twice Before You Outsource: The powerful example of Maersk shows the cost of outsourcing entire functions (like testing, or Operations) to external groups. The 2018 study proves that outsourcing an entire function leads to delays as work is batched and high-priority items wait on lower-priority work in queue. This is the famous handoff waste and directly against key DevOps principles of cross functional teams:
“Analysis shows that low-performing teams are 3.9 times more likely to use functional outsourcing (overall) than elite performance teams, and 3.2 times more likely to use outsourcing of any of the following functions: application development, IT operations work, or testing and QA. This suggests that outsourcing by function is rarely adopted by elite performers. …Misguided performers also report the highest use of outsourcing, which likely contributes to their slower performance and significantly slower recovery from downtime. When working in an outsourcing context, it can take months to implement, test, and deploy fixes for incidents caused by code problems.”
In Maersk’s case, just the top three features represented a delay cost of $7 million per week. So while outsourcing may seem to represent a chance to cut costs, data shows that the delay costs and drag on your deployment rate may far outweigh any supposed savings.
Lean product management: the survey went into some detail about the qualities of Lean Product Management that they found favorable. Here’s a snapshot:
Security by audit versus part of the lifecycle: Great thoughts on how shifting left on security is a key piece of delivery. They recommend making security easy, with frameworks of preapproved libraries, packages and toolchains, and reference examples of implementation, versus late-breaking audits and the disruption and delays that causes:
“Low performers take weeks to conduct security reviews and complete the changes identified. In contrast, elite performers build security in and can conduct security reviews and complete changes in just days. …Our research shows that infosec personnel should have input into the design of applications and work with teams (including performing security reviews for all major features) throughout the development process. In other words, we should adopt a continuous approach to delivering secure systems. In teams that do well, security reviews do not slow down the development process.”
So, that’s my book report. Loved it, as always, though I’m not onboard with everything there. For example, they’ve coined a new phrase – SDO, “Software Delivery and Operational Performance.” Sorry, but to me that’s reliability – the “R” in SRE, which has been around since 2003 in the software world. I don’t see the need for another acronym around that. And they’re splitting hairs a little when separating out automated testing from continuous testing, but I might be wrong on that.
As usual, it’s brilliant, data-driven, and really sets the pace for the entire growing movement of DevOps. LOVE, love the work that Puppet and DORA are producing – keep it up guys!