DORA 2018 State of DevOps report is out!

Hey guys the 2018 State of DevOps report from Puppet/DORA is out! As always, those guys have done an amazing job. You owe it to yourself to download it and check it over, and pass it along.

Here’s the points I found most powerful:

  1. DevOps isn’t a fad; it’s proven to make companies faster and less wasteful in producing new features.
  2. Slower is not safer. Companies releasing every 1-6 months had abysmally slow recovery times.
  3. We can’t eliminate toil or manual work completely – but in low performing companies, it’s basically all we do. High-performers rarely have it make more than 30% of the workday.
  4. Outsourcing an entire function – like QA, or production support – remains a terrible idea. It represents a dramatic cap on innovation and ends up costing far more in delays than you’ll ever see with saved operational costs.
  5. Shift Left” on security continues to grow in popularity – because it works. The best examples are where implementing it early is made as easy and effortless as possible.    

More below. Check it out for yourself, it’s such great work and very easy to read!


The difference between the greats and the not-so-great continues to widen: We’ve heard executives describe DevOps as being a “buzz word” or a “fad”. Ten years into this movement, this seems more and more out of touch with reality. Companies that take DevOps seriously as part of their DNA perform better. They deploy code 46x more frequently; they’re faster to innovate (2,555 times faster lead time). And they do it more safely. Elite performers have 7x lower change failure rate, and can recover 2,604x faster.

DevOps has been proven to lead to faster innovation and change AND produce higher quality work. Honestly, does that sound like a fad to you? (I wonder sometimes if the GM and Chrysler execs in the 1970’s were saying the same thing about Toyota…)

(above image and all others copyright Puppet/DORA 2018)

Releasing infrequently for “safety” is anything but. Many organizations gate releases so they’re spread out over weeks or months, in an attempt to prevent bugs or defects. This backfires terribly; while bug rates may drop, it means their time to recover is disastrously slow. For example, companies that release every 1-6 months have the exact same MTTR – 1-6 months. (!!!!)

“When failures occur, it can be difficult to understand what caused the problem and then restore service. Worse, deployments can cause cascading failures throughout the system. Those failures take a remarkably long time to fully recover from. While many organizations insist this common failure scenario won’t happen to them, when we look at the data, we see five percent of teams doing exactly this—and suffering the consequences. At first glance, taking one to six months to recover from a system failure seems preposterous. But consider scenarios where a system outage causes cascading failures and data corruption, or when multiple unknown systems are compromised by intruders. Several months suddenly seems like a plausible timeline for full recovery.”

Toil and manual work: Elite and high performing orgs do far less manual work. Just look at the percent of people’s time wasted in low performing orgs doing things like hacking out manual configs on a VM, or smoketesting, or trying to push a deployment out the door using Xcopy. Someone on an elite, high performing company might spend 20-30% of their time doing this type of shovel work; at lower performing companies, it’s basically 100% plus of their time.


Think Twice Before You Outsource: The powerful example of Maersk shows the cost of outsourcing entire functions (like testing, or Operations) to external groups. The 2018 study proves that outsourcing an entire function leads to delays as work is batched and high-priority items wait on lower-priority work in queue. This is the famous handoff waste and directly against key DevOps principles of cross functional teams:

“Analysis shows that low-performing teams are 3.9 times more likely to use functional outsourcing (overall) than elite performance teams, and 3.2 times more likely to use outsourcing of any of the following functions: application development, IT operations work, or testing and QA. This suggests that outsourcing by function is rarely adopted by elite performers. …Misguided performers also report the highest use of outsourcing, which likely contributes to their slower performance and significantly slower recovery from downtime. When working in an outsourcing context, it can take months to implement, test, and deploy fixes for incidents caused by code problems.”

In Maersk’s case, just the top three features represented a delay cost of $7 million per week. So while outsourcing may seem to represent a chance to cut costs, data shows that the delay costs and drag on your deployment rate may far outweigh any supposed savings.

Lean product management: the survey went into some detail about the qualities of Lean Product Management that they found favorable. Here’s a snapshot:

Security by audit versus part of the lifecycle: Great thoughts on how shifting left on security is a key piece of delivery. They recommend making security easy, with frameworks of preapproved libraries, packages and toolchains, and reference examples of implementation, versus late-breaking audits and the disruption and delays that causes:

“Low performers take weeks to conduct security reviews and complete the changes identified. In contrast, elite performers build security in and can conduct security reviews and complete changes in just days. …Our research shows that infosec personnel should have input into the design of applications and work with teams (including performing security reviews for all major features) throughout the development process. In other words, we should adopt a continuous approach to delivering secure systems. In teams that do well, security reviews do not slow down the development process.”


So, that’s my book report. Loved it, as always, though I’m not onboard with everything there. For example, they’ve coined a new phrase – SDO, “Software Delivery and Operational Performance.” Sorry, but to me that’s reliability – the “R” in SRE, which has been around since 2003 in the software world. I don’t see the need for another acronym around that. And they’re splitting hairs a little when separating out automated testing from continuous testing, but I might be wrong on that.

As usual, it’s brilliant, data-driven, and really sets the pace for the entire growing movement of DevOps. LOVE, love the work that Puppet and DORA are producing – keep it up guys!






DevOps Stories – Interview with Anne Steiner


Anne Steiner is the Vice President of Product Agility for cPrime. In her role, Anne sets up cross-team discovery cadences, scales product thinking in large organizations, and teaches and mentors stakeholders in leadership and product roles. Anne and her team have helped companies of all shapes and sizes to transform from traditional, project-thinking to become product-driven organizations that emphasize continuous learning. She also actively promotes building communities of practitioners in the Minneapolis/St. Paul area and frequently speaks at national and regional events. She served in the United States Marine Corps as a logistics/embarkation non-commissioned officer in the early 2000’s.

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!



You know, people think of the military as hierarchical, rigid – but in my experience the military is incredibly flexible and dynamic. It has to be to survive in war, and war is becoming more dynamic. Decision making keeps getting pushed down to lower and lower levels.

Just for example, look how we start with boot camp. It starts with dehumanization – with the goal of teaching people that we are all the same; nobody’s special. We take away your clothes, if you’re a guy we shave off your hair. Then we teach the lesson – you do everything as a team. The USMC sets up tasks that are impossible to complete in the time allowed alone. For example, the beds are so close together that if you’re asked to make a bed – your rack mate has to help you with one side of the lower bunk, and then you help her with your side of her bunk. The lesson is, nobody succeeds alone – in boot camp, you can be perfectly right and still get screamed at. I remember once, I made my bed perfectly; the corners were good, and I still got screamed at because I had known what needed to be done and I didn’t help my teammate. The whole process is to drill into your head – this is your family now – you must succeed as a team.

The military’s approach to requirements: Besides shared values, the concept of how orders are delivered in the military has some application to DevOps. In the military, there’s a separation of concerns between the officers who give orders and the enlisted people who carry them out – similar to the division between team members and management. These two groups have very different points of view and misunderstandings or conflicts could hamper an operation or cost lives. To address this – nothing significant happens without a written order describing the commander’s intent. It’s a standard 5 paragraph order that follows the SMEAC format – Situation, Mission, Execution, Administration, and Communication.

Now, the military doesn’t expect its people to document every possible scenario or to follow the words in the order blindly – because we need our people to make independent decisions autonomously as the situation ultimately changes mid-operation. So we don’t fill in all the details but provide the high level intent. The order describes what the commander wants to accomplish, the overall goals and the time frame – you are following orders as long as you’re following the intent and haven’t violated some other direction provided. At cPrime we do the same thing, where we teach product teams something called collaborative framing. That describes what we’re doing, why we’re doing it, and who we are doing it for. That’s pretty similar to the way orders are used in the Marines – the orders provide the high-level strategy and context, and people are allowed to fill in the implementation details later.

I wish this happened more often in the development world. We shouldn’t feel like we have to spoon feed everything to dev teams with detailed requirements – what if we just gave them the intent? We could define the operating requirements, the business goals, and allow them to figure out how to solve the problem.

You want to be told why. A lot of times we aren’t told “why”, just “what” as developers. That’s what surprised me about the military – there was never a leader that I worked with that I couldn’t ask why, in a respectful manner, and be given context. That helps you understand the mission. It always surprised me how open leadership was to questions about orders.

Now I should say – orders aren’t open to question or debate all the time. Sometimes in a crunch we need orders to be followed without question; but that’s actually not the norm, contrary to what most people think.


Keys to Success: What separates out the successful orgs? I find three traits winning organizations have in common:

  • Bold leadership that’s willing to take risks
  • A culture of agility and learning
  • Starting with a small success story

DevOps culture changes obviously come easier with smaller companies; in larger orgs you have to find a pocket where it’s okay to experiment or where a bold leader can nurture and shelter an effort. Once you get to that point where you can start telling stories – we hit this obstacle, and here we hit some snags, but look at these results – that’s where you start to see culture change. You can’t just come in the door and say “We’re going to take risks and become a learning org!”, because you haven’t proven yourself yet. I’m always looking for that right kind of leadership protection, a willingness to experiment, and a group that wants to learn and try something different. That’s your beachhead!

A Single Mission: One of the key factors I see in many successful organizations with their DevOps transformations is to have a legitimate set of shared measures, a shared mission. In the USMC, we have a standard mission – to make Marines, and to win battles. That’s the single mission, and if something in the orders doesn’t relate to that directly – we throw it out. In the software world, it’s not that simple. Every product has a vision, every company has a mission statement. But how many can articulate that simply? Netflix does a great job with a shared mission for example – their shared goals are to retain subscriptions and to increase subscriptions. Whatever you do needs to be aligned against one of these. Can you prove that your project aligns against that? Otherwise you’ll see antipatterns like IT teams saying we have 100% uptime – yeah, that’s great, but you’ve got a crappy product and your customers are unhappy. That’s not product thinking, a clear common goal that everyone can rally around. 


Flexibility and Innovation: There’s a lot of people out there writing books on Agile, and quite a few are well written. But if you slam it on the table, it’s not going to work like it says in the book. Then what are you going to do? The teams that are successful are the ones that can implement this or better yet the parts of it that they think will add value, fail, modify it to their situation, and win anyway. That’s one of the things I love about the way the Agile Manifesto was written, because it is principle based.  We see a lot of organizations struggle because they bring in some “expert” who comes in with a checklist and says, no, you’re not doing scrum unless you’re doing these things. Well, who cares, as long as you’re delivering awesome products?  

As a culture, the USMC takes as a point of pride that it is always asked to do more with less – to us, that “Adapt, Improvise and Overcome” mantra is a real point of pride. I think it comes in part from how we were founded. The Marine Corps has the smallest budget of the branches. There’s not a lot of money flowing through the organization. So that helps us – we realize, no matter what happens, it’s probably not going to work the first time – we’ll adapt and change. Traditionally, we think in the software world that change is bad, we have to limit it, a risk. Well change is inevitable, we should expect it – and win even if we have to come up with a new solution on the fly. 

DevOps Stories – Sam Guckenheimer of Microsoft

The following content is shared from an interview with Sam Guckenheimer, product owner for Visual Studio Team Services. When people ask us “how Microsoft did it” with our DevOps transformation, we often think of the lessons Sam shared with us during our talk. There is so much to learn from here that can help other companies in making their own journey to better, faster, and safer delivery of value!

These and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!



One thing I want to start with – it really annoys me when I read grandiose claims that DevOps is broken in some way. We know that’s just not the case – Gartner tells us at least half of enterprises have something going on with DevOps and they all want to do more. If you look at Agile, which began with the Agile Manifesto back in 2001 – and compare it with where it was as a movement a decade later in 2011 – well, that would look very much like where we are at today, about 10 years after DevOps first began as a concept back in 2009. The trends are really clear, and our success rate and the maturity of the tools and processes part is only going to go up.

Avoid Massive Reorgs: It’s just not true when some say you have to “blow up the organization” to make DevOps work. Change is necessary – you have to get rid of all the handoffs, the waste, and really follow the Lean model with disintermediating developers from production and from customers. But that doesn’t mean you need to make drastic moves and that’s not how we did it at Microsoft. It can be done in an evolutionary way.

Most companies don’t have the luxury of saying, “let’s blow it up” and just jettison decades of code with their legacy applications and start over. That’s your lifeblood! I know that was true with us on the Visual Studio team; we had to go about things in a very gradual way so we didn’t threaten the jugular of our company.

Find Your North Star: Six years ago we found our North Star – how we wanted to go about delivering value using the DevOps mindset – and we pointed to it, saying – “we want to be a world class engineering organization”. Everything we’ve done since then, every major decision we’ve made, has been built around measuring our progress towards that mission.

Jez Humble has joked a few times about some companies trying to “sprinkle magical microservices fairy dust” over things to magically get cloud services architecture. I have to say – there was no fairy dust for us. It required progressive change, some very conscious hard engineering changes, and walking the walk.

Just for example, overhauling our test portfolio and moving to Git took three years. We kept deprecating and replacing older, slow tests with the faster ones incrementally – sprint by sprint, test by test. Now it takes us about 7 minutes to run 70K unit tests before a developer commits to master. But the value is incredible for us – before that, we had these long-running integration tests that had never run completely green, that always required manual intervention and was killing our release flow.

Everything – our refactoring from monolith to microservices, our safe deployment practices, building a lifecycle culture, even our datacenter standup automation – required a lot of work and a multi-year commitment, persistence despite setbacks. We knew though where the “North Star” was and we were committed. Our approach was – set the goal, measure the progress, and keep going until we get there.

Production Support: Shifting to a production support mindset was a big change and of course not everyone was onboard, especially at first. We knew that would be our most important and critical win – making sure the delivery teams were onboard and happy with what was going on. We measured this as one of our first KPI’s. We would do regular surveys of engineering satisfaction and go into depth about their jobs, how tooling was supporting their jobs, the process was supporting their jobs – and what we saw was a steady rise in satisfaction.

Just for example on this, one of the things we measured was alerting frequency – are we getting to the right person the first time? That’s something we are always watching – if you’re waking people up at 2 in the morning, it had better be the right person. We needed to make sure that we are paying attention to the things that matter to people’s lives and their satisfaction with their jobs.

When you’re genuine, you get a genuine response. This all helps build that high-trust culture that Gene Kim and others have emphasized as key.

The concept of servant leadership has been a big part of our change; good managers care about their team and look for ways to make their jobs easier. That’s the Andon cord philosophy – anyone on the floor can pull that cord, stop the line if needed – and the manager comes over, the root cause is identified and rolled into the process so future incidents don’t happen. So in our case – we don’t close out livesite incidents for example until the fix is identified and in the backlog so it won’t happen again.

Setting Goals and Metrics: Our North Star remains fixed but we are always redefining how we want to get there. Every 6 months we select, epic by epic, 3 or 4 goals that define success for us over the next six months, and the specific metrics that will define them. We publish these and they’re flowed all the way up the management chain. Those goals and metrics on an epic level don’t change for those six months. Each person on the feature crews know which epics they’re working on and can ask each sprint – what are the next few things we need to do to move the needle along these goals? They look ahead about 3 sprints in terms of what they’re trying to do – no more than that. That level of planning is key for us to make progress in an iterative way and minimize disruption.

In the beginning, we thought it was really not a big deal to figure out the metrics and focus on the right thing and so forth. It turns out that finding the right metrics is as complicated as designing the right feature. It’s really not obvious what in terms of measurement and what you’re striving for. Very frequently, you don’t have an out of the box way of doing the telemetry – so you need to instrument for the business API’s you want.

A really clear example on this – one of the metrics that we’re interested in is, how many developers are working on projects that are doing continuous delivery to Azure? That’s a very hard thing to count. You have to make several leaps of instrumentation and joins in order to answer that. Asking the question clearly and getting a way of gathering data on it is a real engineering problem – and one that typically is made to sound much simpler and less of an obstacle than it really is on the web or in books about lean customer analytics.

This goes way, way beyond your standard # of site visitors or simple generic use cases for a website. Until you start getting down to brass tacks and define what the things are that we care about as a business and why – it’s difficult to appreciate how challenging it is to come up with the right measurables.

Value Stream Mapping: I’m going to shock you a little here – we don’t do normal value stream mapping here. My observation is that value stream mapping is really effective when you want to get people on the same page and get some momentum going towards a DevOps movement. Once you show people – wow, it takes us 60 days to get something to production, and most of that is wait time – 5 days for approval here, 7 days for testing here – that’s great to get everyone to see the elephant in the room. It never fails to shock people once they see how huge that bucket of idle time is!

For us, we’re past that initial shock phase. We focus heavily on all the things that value stream mapping attacks in terms of handoffs, idle time versus process time, etc – but it is definitely not something you need to do on an ongoing basis, in my opinion.

Two Key Antipatterns: I see two key failings that sometimes trips organizations up. First, people often think in terms of formulas – you need to do X with the people, Y with the process, and Z with the tools – and think of each of these as being independent pillars, that you can tackle one at a time in phases. It ends up being counterproductive, making things more complicated and lengthening things, because in reality all these things are interrelated and need to be thought of together.

My advice is to fight the tendency to take a single practice, however good, and try to implement it in isolation. Think in terms of all three columns as supporting a single building together; each improvement should touch on people, process, and tools in some way and make it a little better. Focus on the quick wins – try to stairstep your maturity, building something small that quickens that release cycle and delivers feedback faster.

The second antipattern is not getting the right balance of leadership and delegation. You need to have obvious skin in the game from leadership, and initiative from individual practitioners. Think back to that great book “Drive” by Dan Pink, which stressed the leadership value between Autonomy, Mastery, and Purpose. You are going to need to spark people and get them enthusiastic, active, and feeling like they control their destiny – autonomy.

It’s really part art and science, because that autonomy has to be balanced with purpose, which is driven consistently and forcefully by management. And if you look at most of the current execs at Microsoft, you will see that they practice both high empathy and engage deep technically.

Mission is key for us but it goes beyond just a few words or a slogan. We put up guardrails, very clear rules of the road that specifies “here is what you need to do to check your code into master.” We have a very clear definition of done that is common in every team – “code delivered with tests and telemetry and deployed in production worldwide.”

This is the exact opposite of “it works on my machine” – and everyone knows it. If you’re doing new work, there’s a set of common services we provide, including sample code and documentation. So no one has to reinvent the wheel when it comes to telemetry for example – you might improve on it, but you would never have to deliver this from scratch, it’s reused from a common set of services.

DevOps and burnout – it’s a real thing

Today I woke up to the news that one of my favorite heroes in the food world, Anthony Bourdain, was found dead in his hotel room – an apparent suicide. He leaves behind an 11 year old daughter and a longtime girlfriend and anguishing questions that will likely never be resolved by those that love him. It’s the second suicide I’ve heard of this week. It seems like now is a good time to talk about burnout and job stress. If “DevOps is compassion”, as John Willis is fond of saying, we really need to do a better job in our industry of protecting our people from the stress that is claiming so many lives.

Note – these aren’t my words – they come from my cowriter and good friend Knox Lively. The stories he’s telling below are real and exposes a real problem that’s causing a hidden but very real emotional and physical health catastrophe in our field. Please give this some thought, and spend more time with the ones you love.

Burnout in our industry is common, and it often impacts the brightest, most positive contributors to the team. The symptoms include feeling exhausted, cynical, or ineffective; little or no sense of accomplishment in your work; and feelings about your work negatively affecting other aspects of your life. We’ve all seen the impacts on the lives of people around us; broken families, severe depressions, and even suicide.

I think it’s safe to say that most of us have at least one story related to the topic of burnout, illness, or even death as a result of misaligned work ethics and objectives from the individual to the organizational level. I have a couple of personal anecdotes I’d like to share surrounding this topic.

The first of which involves a brilliant architect at a software company I worked with at a startup in Austin, Texas. Some of you may have a similar person in your organization, they are always the last person you call. Whatever the issue, whatever the time, you can count on them to fix, or at least know how to fix the problem. This particular person was involved in a very tedious, multi-month port from one application server to the another. All at the recommendation of a C-level exec who talked to another C-level exec on a plane who happened to mention “application server x is n percent faster than application server y”. You’ve heard these kinds of stories too I presume? Anyways, right in the middle of this port, this person was asked by management to not ride his brand-new Ducati bike that he’d recently purchased. For those unfamiliar, Ducati has a reputation for being an incredibly powerful performance motorcycle. He was asked this, as no stretch of the imagination is needed, because if something were to happen to him, g-d forbid, the company would be in dire straits. This was just one of the many ways his job impacted his life in an unhealthy way. He often got to wear the hat of the hero, this, coupled with his uncanny expertise and the fact that the organization had allowed unhealthy work habits to remain unchecked, he had unknowingly begun to build himself a prison. Neither he nor the organization is at complete fault here as we are dealing with something even larger, a cultural problem.

The second anecdote I’ll share with you highlights the real dangers of stress, even when handled appropriately and in accordance with the best and current practices for handling said stress. This person had everything a software engineer could ever want. He was highly paid, found great satisfaction in his job, and worked for a company he believed in. The trifecta in terms of a career. In addition to career success, he had a very happy and fulfilled personal life. He had a wife, kids, was an avid lover of the outdoors, as well as ate right and exercised. He seemed to have all of his bases covered in terms of having a holistic and balanced life, and most would agree. None of this, however, prevented him from collapsing on his mountain bike and dying of a massive heart attack. It came as a shock to the whole company. How could this person have a heart attack? Everyone mentally checked off the boxes they’d read dozens of times for how to have a healthy heart. He met them all. The one thing that wasn’t accounted for was the great deal of stress this person had been under. Remember the earlier anecdote about the architect? This is the same person. He’d become indispensable to the companies’ daily operations, which meant 60+ hour work weeks for years, because without him often things did not get done. This on top of the side projects he had worked on over the years meant he spent most of his time working and relatively little time to decouple his stress.

(image credit


In the book we’re writing, we’re proposing DevOps as a solution in part to a life spun out of control. We believe – and experience shows – that moving more work from the “have-to” drudge work to more creative, automated, and sparkly-techy side of the page will help over time in reducing our stress load and improving our quality of life. But there’s no denying it – sometimes teams get hooked on the endorphins and rush that comes from the long nights getting a release fixed and out the door.

Think of the unforgettable character of Brent in the Phoenix Project – the one person no one could live without, the single, irreplaceable point of failure – and bottleneck – for everything the team did. Brent and the people like him are locked in a state other authors have called “full catastrophe living”, a prison they’ve built themselves by failing to see the need to improve processes and share information with others. In this case, they’re getting a short-term payoff in several areas – the warm glow of feeling respected and irreplaceable, and a certain amount of job security. This heroism comes with a high price though in terms of the team’s overall capacity and ability to learn, and over time inevitably on the health of these enterprise Atlases. Convincing these heroes of the value of DevOps is often a hard and long road; often it takes a top-down commitment to change behavior to a more sustainable pattern to overcome resistance from this point of view.

This topic of burnout and stress in the tech industry is worthy of a whole other book, and one that we can barely even begin to scratch the surface on. In terms of general tips for reducing stress on a personal level here are just a few key tips to begin your journey to a healthier work/life balance.

The first is to talk, whether it be your partner, your boss, or a therapist. Don’t assume you’re the only one under such pressure or that to talk about it is weak. We often assume that others are going through the same so we should just “suck it up”. The truth is, no one knows what’s on your mind unless you tell them. Don’t let pride take you down a one-way street to loneliness and depression. Did you know loneliness, isolation, and depression are bigger killers than obesity? (Malito, 2017). This factor alone greatly increases your risk for “premature” death. A pretty sobering fact if you think about it.

Two, create barriers between your personal and work life. This can come in various forms such as leaving your laptop at work on the weekend, or after every workday if you can manage. If you must have your devices with you at all times, try not checking work email while at home, or at a minimum not checking it one hour before going to bed, nor within the same hour you awake in the morning. These are just a few of the many ways to decouple your work and personal life.

Third, establish barriers within your organization. Most of my career I’ve been at the mercy of what I call a “push” workload. Meaning work has always come to me, whether I want it to or not. It’s expensive on many fronts. It causes task switching, induces stress, and is inefficient in design, or lack thereof. At the heart of DevOps is the idea of designing inefficient workflow issues out of your organization. The same can be done on a personal level. Find small ways to create tension, the good kind, between you and those who summon your work. Establish protocols and command chains for who and when is pinged for certain tasks. In the same realm improve and utilize internal documentation tools to empower others to solve problems for their selves. These are simply a few of many ways to “shield” yourself so that you can stay focused and knock out more work with less effort.

For more on burnout and the impacts of stress on our environment, please see the following.


DevOps Stories –Jon Cwiak, Humana

The following content is shared from an interview with Jon Cwiak, Enterprise Cloud Platform Architect at Humana. What we loved about talking with Jon was his candor – he’s very honest and upfront that the story of Humana’s adoption of DevOps has not always been smooth, and the struggles and challenges they’re facing. Along the way we learned some eye-opening insights:

  • Having a DevOps team isn’t necessarily a bad thing
  • How can you break down walls and change very traditional mindsets or siloed groups?
  • How two metrics alone can tell you how your organization’s health
  • The Humana story as a practical roadmap, from version control to config management to feature toggles and microservices
  • The power of laziness as a positive career trait!

We loved our talk with Jon and wanted to share his thoughts with the community. Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!


My name is Jon Cwiak – I’m an enterprise software architect on our enterprise DevOps enablement team at Humana, a large health insurance company based out of Kentucky. We are in the midst of a translation from the traditional insurance business into what amounts to a software company specializing in wellness and population health.

Our main function is to promote the right practices among our engineering teams. So I spend a big part of each week reinforcing to groups the need for hygiene – that old cliché about going slow to go fast. Things like branching strategy, version control, configuration management, dependency management – those things aren’t sexy but we’ve got to get it right.

Some of our teams though have been doing work in a particular way for 15 years; it’s extraordinarily hard to change these indoctrinated patterns. What we are finding is, we succeed if we show we are adding value. Even with these long-standing teams, once they see how a stable release pipeline can eliminate so much repetitive work from their lives, we begin to make some progress.

We are a little different in that there was no trumpet call of “doing DevOps” from on high – instead it was crowdsourced. Over the past 5 years, different teams in the org have independently found a need to deliver products and services to the org at a faster cadence. It’s been said that software is about two things – building the right thing and building the thing right. My group’s mission is all about that second part – we provide the framework, all the tools, platforms, architectural patterns and guidance on how to deliver cheaper, faster, smarter.

The big picture that’s changed for us as a company is the realization that doing this big-bang, waterfall, shipping everything in 9 months or more mega-events just doesn’t cut it anymore. We used to do those vast releases – a huge flow of bits like water, we called it a tsunami release. Well just like with a real tsunami there’s a wave of devastation after the delivery of these large platforms all at once that can take months of cleanup. We’ve changed from tsunami thinking to ripples with much faster, more frequent releases.

When the team first started up in 2012, the first thing we noticed was that everything was manual. And I mean everything – change requests, integration activity, testing. There was lots of handoffs, lots of Conway’s law at work.

So we started with the basics. For us, that was getting version control right – starting with basic hygiene practices, doing things in ways that decouple you from the way in which the software is being delivered. Just as an example, we used to label releases based on the year, quarter and month where a release was targeted for. So if suddenly a feature wasn’t needed – just complete integration hell. Lots of merges, lots of drama as we were backing things out. So we moved toward semantic versioning, where products are versioned regardless of when they’re delivered. Since this involved dozens of products and a lot of reorganization, getting version control right took the better part of 6 months for us. But it absolutely was the ground level for us being able to go fast.

Next up was fixing the way the devs worked. We had absolutely no confidence in the build process because it was xcopy manual deployments – so there was no visibility, no accountability, and no traceability. This worked great for the developers, but terrible for everyone else having to struggle with “it works on my machine!” So, continuous integration was the next rung on the ladder, and we started with a real enterprise build server. Getting to a common build system was enormously painful for us; don’t kid yourself that it’s easy. It exposed, application by application, all the gaps in our version control, a lot of hidden work we had to race to keep ahead of. But once the smoke cleared, we’d eliminated an entire category of work. Now, version control was the source of fact, the build server artifacts were reliable and complete. Finally we had a repeatable build system that we could trust.

The third rung of the ladder was configuration management. It took some bold steps to get our infrastructure under control. Each application had its own unique and beautiful configuration, and no two environments were alike – dev, QA, test, production, they were all different. Trying to figure out where these artifacts were and what the proper source of truth was required a lot of weekends playing “Where’s Waldo”! Introducing practices like configuration transforms gave us confidence we could deploy repeatedly and get the same behavior and it really helped us enforce some consistency. The movement toward a standardized infrastructure – no snowflakes, everything the same, infrastructure as code – has been a key enabler for fighting the config drift monster.

The data layer has been one of the later pieces to the puzzle for us. With our move to the cloud, we can’t wait for the thumb of approval from a DBA working apart from the team. So teams are putting their database under version control, building and generating deployable packages through DACPACs or ReadyRoll, and the data layer just becomes another part of the release pipeline. I think over time that traditional role of the DBA will change and we’ll see each team having a data steward and possibly a database developer; it’s still a specialized need and we need to know when a data type change will cause performance issues for example, but the skillset itself will get federated out.

Using feature toggles changes the way we view change management. We’ve always viewed delivery as the release of something. Now we can say, the deployment and the release are two different activities. Just because I deploy something doesn’t mean it has to be turned on. We used to view releases as a change, which means we needed to manage them as a risk. Feature toggles flips the switch on this where we say, deployments can happen early and often and releases can happen at a different cadence that we can control, safely. What a game-changer that is!

COTS products and DevOps totally go together. Think about it from an ERP perspective – where you need to deliver customizations to an ERP system, or or whatever BI platform you’re using. The problem is, these systems weren’t designed in most cases to be delivered in an agile fashion. These are all big bang releases, with lots of drama, where any kind of meaningful customization is near taboo because it’ll break your next release. To bridge this gap, we tell people not to change but add – add your capabilities and customizations as a service, and then invoke thru a middleware platform. So you don’t change something that exists, you add new capabilities and point to it.

Gartner’s concept of bimodal IT I struggle with, quite frankly. It’s true you can’t have a one size fits all risk management strategy – you don’t want a lightweight website going through the long review period you might need with a legacy mainframe system of record for example. But the whole concept that you have this bifurcated path of one team moving at this fast pace, and another core system at this glacial pace – that’s just a copout I think, an excuse to avoid the modern expectations of platform delivery.

We do struggle with long lived feature branches. It’s a recurring pain point for us, we call it the integration credit card that teams charge to and it inevitably leads to drama at release time, some really long weekends. In a lot of cases the team knows this is bad practice and they definitely want to avoid it, but because of cross dependencies we have these long-lived branches. The other issue is contention, which usually is an architecture issue. We’re moving towards a one repo, one build pipeline and decomposing software down to its constituent parts to try to reduce this, but decoupling these artifacts is not an overnight kind of thing.

The big blocker for most organizations seems to be testing. Developers want to move at speed, but the way we test – usually manually – and our lack of investment in automated unit tests creates these long test cycles which in turn spawns these long-lived release branches. The obvious antidote are feature toggles to decouple deployment from delivery.

I gave a talk a few years back called “King Tut Testing” where we used Mike Cohn’s testing pyramid to talk about where we should be investing in our testing. We are still in the process of inverting that pyramid – moving away from integration testing and lessening functional testing, and fattening up that unit testing layer. A big part of the journey for us is designing architectures so that they are inherently testable, mockable. I’m more interested in test driven design than I am test driven development personally, because it forces me to think in terms of – how am I going to test this? What are my dependencies, how can I fake or mock them so that the software is verifiable. The carrot I use in talking about this shift and convincing them to invest in unit testing is, not only is this your safety net, it’s a living, breathing definition of what the software does. So for example, when you get a new person on the team, instead of weeks of manual onboarding, you use the working test harness to introduce them to how the software behaves and give them a comfort level in making modifications safely.

The books don’t stress enough how difficult this is. There’s just not the ROI to support creating a fully functional set of tests with a brownfield software package in most cases. So you start with asking, where does this hurt most? – using telemetry or tools like SonarQube. And then you invest in slowing down, then stopping the bleeding.

Operations support in many organizations tends to be more about resource utilization and cost accounting – how do I best utilize this support person so he’s 100% busy? And we have ticketing systems that create a constant source of work for Operations and activity. The problem with this siloed thinking is that the goal is no longer developing the best software possible and providing useful feedback, it’s now closing a ticket as fast as possible. We’re shifting that model with our move to microservices to teams that own the product and are responsible for maintaining and supporting it end to end.

Lots of vendors are trying to sell DevOps In A Box – buy this product, magic will happen. But they don’t like to talk about all the unsexy things that need to be done to make DevOps successful – four years to clean up version control, for example. It’s kind of a land grab right now with tooling – some of those tools are great in unicorn space but not so well with teams that were using long lived feature branches.

Every year we do an internal DevOps Day, and that’s been so great for us in spreading enthusiasm. I highly recommend it. The subject of the definition of DevOps inevitably comes up. We like Donovan Brown’s definition and that’s our standard – one of the things I will add is, DevOps is an emergent characteristic. It’s not something you buy, not something you do. It’s something that emerges from a team when you are doing all the right things behind the scenes, and these practices all work together and support each other.

There’s lots of metrics to choose from, but two metrics stand out – and they’re not new or shocking. Lead time and cycle time. Those two are the standard we always fall back on, and the only way we can tell if we’re really making progress. They won’t tell us where we have constraints, but it does tell us which parts of the org are having problems. We’re going after those with every fiber of our effort. There’s other line of sight metrics, but those two are dominant in determining how things are going.

We do value stream analysis and map out our cycle time, our wait time, and handoffs. It’s an incredibly useful tool in terms of being a bucket of cold water right to the face – it exposes the ridiculous amount of effort being wasted in doing things manually. That exercise has been critical in helping prove why we need to change the way we do things. Its specific, quantitative – people see the numbers and get immediately why waiting two weeks for someone to push a button is unacceptable. Until they see the numbers, it always seems to be emotional.

A consistent definition of done – well, we’re getting there. Giving people 300 page binders, or a checklist, or templated tasks so developers have to check boxes – we’ve tried them all, and they’re just not sustainable. The model that seems to work is where the team is self policing, where a continuous review is happening of what other people on the team are doing. That kind of group accountability is so much better than any checklist. You have to be careful though – it’s successful if the culture supports these reviews as a learning opportunity, a public speaking opportunity, a chance to show and tell. In the wrong culture, peer reviews or code demos becomes a kind of group beat-down where we are criticizing and nitpicking other people’s investment.

A DevOps team isn’t an antipattern like people say. Centralizing the work is not scalable, that is definitely an antipattern. But I love the mission our team has, enabling other groups to go faster. It’s kind of like being a consulting team – architectural guidance and consulting, practices. It’s incredibly rewarding to help foster this growing culture within our company, we are seeing this kind of organic center of excellence spring up.

What I like to tell people is, be like the best developers out there, and be incredibly selfish and lazy. If you’re selfish, you invest in yourself – improving your skillset, in the things that will give you a long-term advantage. If you’re lazy, you don’t want to work harder than you have to. So you automate things to save yourself time. Learning and automation are two very nice side effects of being lazy and selfish, and it’s a great survival trait!