Starting and Scaling DevOps in the Enterprise – review

As many of you know, I’m a huge fan of the work Gary Gruver has done – in particular his book “Leading the Transformation” on his experiences at HP trying to transform a very traditional enterprise. (See my earlier mention of his book on this blog, here.) His newest work is out – Starting and Scaling DevOps in the Enterprise. I am recommending it very highly to all my customers that are following DevOps! I think its unique – by far the best I’ve read so far when it comes to putting together specific metrics and the questions you’ll need to know in setting your priorities.

Gary notes that there are three types of work in an enterprise:

  1. New work – Creating new features or integrating/building new applications
    1. new work can’t be optimized (too much in flux)
    2. Best you can hope for here is to improve the feedback loop so you’re not wasting time polishing features that are not needed (50%+ in most orgs!)
  2. Triage – finding the source of defects and resolving
    1. Here DevOps can help by improving level of automation. Smaller batch sizes means fewer changes to sort through when bugs crop up.
  3. Repetitive – provisioning environments, building, testing, configuring the database or firewall, etc.
    1. More frequent runs, smaller batches, feedback loop improved. All the DevOps magic really happens in #2 and #3 above as these are the most repetitive tasks.

Notice of the three types above – the issues could be in one of five places:

  1. Development
    1. Common pain point here is Waterfall planning – i.e. requirements inventory and a bloated, aging inventory)
  2. Building Test Environments
    1. Procurement hassles across server, storage, networking, firewall. Lengthy handoffs between each of these teams and differing priorities.
    2. Horror story – 250 days for one company to attempt to host a “Hello World” app. It took them just 2 hours on AWS!
  3. Testing and Fixing Defects – typically QA led
    1. Issues here with repeatability of results (i.e. false positives caused by the test harness, environment, or deployment process)
    2. Often the greatest pain point, due to reliance on manual tests causing lengthy multi-week test cycles, and the time it takes to fix the defects discovered.
  4. Production Deployment – large, cross org effort led by Ops
  5. Monitoring and Operations

The points above are why you can’t just copy the rituals from one org to another. For any given company, your pain points could be different.


So, how do we identify the exact issue with YOUR specific company?

  1. Development (i.e. Requirements)
    1. Metrics:
      1. What % of time is spent in planning and documenting requirements?
      2. How many man-hours of development work are currently in the inventory for all applications?
      3. What % of delivered features are being used by customers and fit the expected results?
    2. An important note here – organizations often commit 100% of dev resources to address work each sprint. This is terrible as a practice and means that the development teams are too busy meeting preset commitments to respond to changes in the marketplace or discoveries during development. The need here is for education – to tell the business to be reasonable in what they expect, and how to shape requirements so they are actual minimum functionality needed to support their business decisions. (Avoid requirements bloat due to overzealous business analysts/PM’s for example!)

  1. Provisioning environments
    1. Metrics:
      1. How much time does it take to provision environments (on avg)
      2. How many environments are requested per month/sprint
      3. % of time these environments require manual fixing before they are complete
      4. % of defects associated with non-code – i.e. environments, deployments, data layer, etc.
    2. The solution here for provisioning pinch points is infrastructure as code. Here there is no shortcut other than developers and IT/operations working together to build a working set of scripts to recreate environments and maintaining them jointly. This helps with triage as changes to environments now show up clearly in source control, and prevents DEV-QA-STG-PROD anomalies as it limits variances between environments.
    3. It’s critical here for Dev and Ops to use the same tool to identify and fix issues. Otherwise strong us vs them backlash and friction.
    4. This requires the organization to have a strong investment in tooling and think about their approach – esp with simulators/emulators for companies doing embedded development.

  1. Testing
    1. Metrics
      1. What is the time it takes to run a full set of tests?
      2. How repeatable are these? (i.e. what’s the % of false errors)
      3. What % of defects are found with testing (either manual, automated, or unit testing)
      4. What is the time it takes to approve a release?
      5. What’s the frequency of releases?
    2. In many organizations this is the most frequent bottleneck – the absurd amount of time it takes to complete a round of tests with a reasonable expectation the release will work as designed. These tests must run in hours, not days.
    3. You must choose a well-designed automation framework.
    4. Development is going to have to change their practices so the code they write is testable. And they’ll need to commit to making build stability a top priority – bugs are equal in priority (if not higher than) tasks/new features.
    5. This is the logical place to start for most organizations. Don’t just write a bunch of automated tests – instead just a few automated Build Acceptance Tests that will provide a base level of stability. Watch these carefully.
      1. If the tests reveal mostly issues with the testing harness, tweak the framework.
      2. If the tests are finding mostly infrastructure anomalies, you’ll need to create a set of post-deployment tests to check on the environments BEFORE you run your gated coding acceptance test. (i.e. fix the issues you have with provisioning, above).
      3. If you’re finding coding issues or anomalies – congrats, you’re in the sweet spot now!
    6. Horror story here – one company boasted of thousands of automated tests. However, these were found to not be stable, maintainable, and had to be junked.
    7. Improve and augment over time these BATs so your trunk quality gradually moves closer to release in terms of near-produciton quality.
      1. Issue – what about that “hot” project needed by the business (which generally arrives with a very low level of quality due to high pressure?
        1. Here the code absolutely should be folded into the release, but not exposed to the customer until it fits the new definition of done: “All the stories are signed off, automated testing in place and passing, and no known open defects.”

  1. Release to Production
    1. If a test cycle takes 6 weeks to run, and management approval takes one day – improving this part just isn’t worth it. But if you’re trying to do multiple test cycles a week and this is the bottleneck, absolutely address this with managers that are lagging in their approval or otherwise not trusting the gated testing you’re doing.
    2. Metrics
      1. Time and effort to release to production
      2. Number of issues found categorized by source (code, environment, deployment process, data, etc)
      3. Number of issues total found in production
      4. MTTR – mean time to restore service
      5. # of green builds a day
      6. Time to recover from a red build
      7. % of features requiring rework before acceptance
      8. Amt of effort to integrate code from the developers into a buildable release
    3. For #1-4 – Two areas that can help here are feature toggling (which you’ll be using anyway), and canary releases where key pieces of new functionality are turned on for a subset of users to “test in production.”
    4. For #5-6 – here Continuous Integration is the healer. This is where you avoid branching by versioning your services (and even the database – see Refactoring Databases book by Scott)
    5. For #7-8 – If you’re facing a lot of static here likely a scrum/agile coach will help significantly.


So – how to win, once you’ve identified the pain points? You begin by partitioning the issue:

  • Break off pieces that are tightly coupled versus not developed/tested/deployed as a unit. (i.e. HR or Purchasing processes)
  • Segment these into business critical and non-business critical.
  • Split these into tightly coupled monoliths with common code sharing requirements vs microservices (small, independent teams a la Amazon). The reality is – in most enterprises there’s very valid reasons why these applications were built the way they are, You can’t ignore this complexity, much as we’d like to say “microservices everywhere!”

I really admire Gary’s very pragmatic approach as it doesn’t try to accomplish large, difficult things all at once but it focuses on winnable wars at a company’s true pain points. Instead of trying to force large, tightly coupled organizations to work likely loosely coupled orgs – you need to understand the complex systems and determine together how to release code more frequently without sacrificing quality. Convince these teams of DevOps principles.


Culture and Agile – GM and its failed attempt to mimic Toyota

I had a good friend of mine – Mark Taylor – recommend some listening material recently on GM. I’ve been fascinated with Toyota since I first started learning about Agile development practices, and this podcast definitely was worth the time to listen. It’s a fascinating story. Why was Toyota so willing to be so open and revealing with one of its biggest competitors – GM – on its higher quality production processes? Turns out there’s a lot more to making cars than just an assembly line.

This isn’t just history. All successful companies hit a moment of complacency. For people who are interested in improving the quality of their working life – whatever the field – there’s some real lessons here. (And, if you’re still not convinced, think of all the billions of your taxpayer dollars that had to go into bailing American car companies after they went bankrupt!)

Some thoughts I had – in outline form – from this:

  • Culture Matters (are your teams top down or horizontal?)
    • “Back home in Fremont, GM supervisors ordered around large groups of workers. At the Takaoka plant, people were divided into teams of just four or five, switched jobs every few hours to relieve the monotony, and a team leader would step in to help whenever anything went wrong.”
  • Stopping The Line With Defects (how do you handle bugs?)
    • I can’t remember any time in my working life where anybody asked for my ideas to solve the problem. And they literally want to know. And when I tell them, they listen, and then suddenly they disappear, and somebody comes back with the tool that I just described. It’s built, and they say try this. Under the Toyota system, everyone’s expected to be looking for ways to improve the production process all the time, to make the workers’ job easier and more efficient, to shave extra steps and extra seconds off each worker’s job. To spot defects in the cars and the causes of those defects. This is the Japanese concept of kaizen, continuous improvement. When a worker makes a suggestion that saves money, he gets a bonus of a few hundred dollars or so…. And if you look around the Toyota plant, you can see the result of all those improvements. Hanging shelves that travel along with the car and the worker, carrying the parts and bolts they need within easy reach. Special cushions they throw into the car frames when they have to kneel inside. Workers’ tasks have been streamlined to the fewest possible steps, each step timed down to the second.
    • In contrast, in GM plants, workers could never stop the line – because they’re lazy, you know? “So now we tell the plant floor, don’t you worry about the production volume. You worry about quality. The last thing we want is to have a lot of defects flowing down the line that we have to repair later.”
  • It Takes Brains – You Can’t Just Mimic
    • (after a failed trainsplant) “For this workforce, there were no trips to Japan, no tearful sushi parties. And from the start, workers were skeptical…. This was one of the biggest differences between Fremont and Van Nuys. Van Nuys hadn’t been shut down. Turns out it’s a lot easier to get workers to change if they’ve lost their jobs, and then you offer them back. Without that, many union members just saw the Toyota system as a threat.”
    • “…much of the Japanese system happened off the factory floor, it answered something that had never quite made sense to {one of the managers}. Why had Toyota been so open with GM in showing its operations? We didn’t understand this bigger picture thing. All of our questions were focused on the floor, you know? The assembly plant. What’s happening on the line. That’s not the real issue. The issue is, how do you support that system with all the other functions that have to take place in the organization?”
    • “I remember one of the GM managers was ordered from a very senior level– it came from a vice president– to make a GM plant look like NUMMI. And he said, I want you to go there with cameras and take a picture of every square inch. And whatever you take a picture of, I want it to look like that in our plant. There should be no excuse for why we’re different than NUMMI, why our quality is lower, why our productivity isn’t as high, because you’re going to copy everything you see. Immediately, this guy knew that was crazy. We can’t copy employee motivation. We can’t copy good relationships between the union and management. That’s not something you can copy, and you can’t even take a photograph of it.”
  • Its Not Just The Assembly Line
    • “The team concept stressed continuous improvement. If a team got a shipment of parts that didn’t fit, they’d alert their bosses, who’d then go to the suppliers to fix the problem. Sometimes they’d realize the problem was in the part’s design, and Toyota engineers would go back to the drawing board and remake the part to address the problem workers were having on the assembly line. All the departments in the company worked together. …. But Ernie’s suppliers had never operated in a system like that. If he asked for fixes, they blew him off. And if he called Detroit and asked them to redesign a part that wasn’t working, they’d ask him, why was he so special? They didn’t have to change it for any other plant. Why should they change it for him?”
  • The High Cost of Complacency
    • “One of the ironies of GM was that in the moment it went bankrupt, it was probably a better company than it had ever been. In the factories, they had really dramatically closed the productivity gap that they had had for many, many years. And on the new products, they have much better quality. So the company that failed was actually doing better than it had ever done. But it was too late, and that’s really sort of hard to forgive– that if you take 30 years to figure it out, chances are you’re going to get run over. And they got run over.”
    • “They sold junk for a while. Just any kind of piece of crap they could roll out there, they did. And they paid a tremendous price for it. And even when they turned the corner in quality, people didn’t trust them. They’d say, well, gee, they’re building a good car now. Why aren’t they buying them?”

Give Yourself Nine Months to Fail.

(Note – this is a Greatest Hits posting from my previous blog. Enjoy!)

Babies aren’t born in one month.

Implementing Scrum Means Making Mistakes. Lots and Lots of Mistakes.

When I started on at my current employer – even after nine months as a team lead – I had very little to boast about by way of making change. I remember hearing a presentation from another manager that had the title, “Keeping The Lights On” – WOW! – And honestly that was how I felt about my job. Keeping the lights on, reacting to events – not getting ahead of them, and not able to control them. I was very disconnected from the work my team was doing. This changed as we moved out developers that were not contributing to the team and not being transparent about their work; and, as we got new projects coming in, I could cherrypick the fun ones and start participating in writing specfications and deploying solutions. Beyond taking on new work, though, Agile is the biggest reason why I’m still around. Without it, I’d be like the manager at my previous company – completely isolated from the daily work my team is doing, trying to defend our existence without the facts I need to prove that we’re delivering value.

I started thinking about my company – which seems to love mountains – and how every company’s definition of Agile is a little different. At the keynote I met an old compatriot – we had worked on a project together that was a failed Agile project. Everyone hated the DSU’s, which were 15+ minutes long, there was no target in sight since releases were pushed out to “never”, we went through constant rewrites as the technical team constantly refactored working code to get it “perfect”… it was a case study in how to do Agile wrong. After 18 months of development, they had to scrap the entire project and outsourced it to an offshore team – not one line of code ever saw the light of day. I believe a big reason why we failed was, we tried to change everything at once – and the team never gelled or considered itself invested in the outcome. In contrast, almost by accident, by doing things step by step – and rolling back when things weren’t working – we were successful in my current assignment. The path below took almost two years to implement, step by step – but it was done with the team setting the pace, and almost by accident we reached our goals.

I started out by talking about the fears I felt after a few months on the job. Overwhelmed, disconnected. I said, “I feel at times like I wasn’t as much in control as I need to be. I wasn’t in command of all the facts I need to support my case. I didn’t have enough visibility of what’s going on across the organization. I wasn’t giving my team all the tools and resources they need to thrive. And I wasn’t providing enough proof of delivering value aligned with what my company’s priorities are.”