DevOps

Starting and Scaling DevOps in the Enterprise – review

As many of you know, I’m a huge fan of the work Gary Gruver has done – in particular his book “Leading the Transformation” on his experiences at HP trying to transform a very traditional enterprise. (See my earlier mention of his book on this blog, here.) His newest work is out – Starting and Scaling DevOps in the Enterprise. I am recommending it very highly to all my customers that are following DevOps! I think its unique – by far the best I’ve read so far when it comes to putting together specific metrics and the questions you’ll need to know in setting your priorities.

Gary notes that there are three types of work in an enterprise:

  1. New work – Creating new features or integrating/building new applications
    1. new work can’t be optimized (too much in flux)
    2. Best you can hope for here is to improve the feedback loop so you’re not wasting time polishing features that are not needed (50%+ in most orgs!)
  2. Triage – finding the source of defects and resolving
    1. Here DevOps can help by improving level of automation. Smaller batch sizes means fewer changes to sort through when bugs crop up.
  3. Repetitive – provisioning environments, building, testing, configuring the database or firewall, etc.
    1. More frequent runs, smaller batches, feedback loop improved. All the DevOps magic really happens in #2 and #3 above as these are the most repetitive tasks.

Notice of the three types above – the issues could be in one of five places:

  1. Development
    1. Common pain point here is Waterfall planning – i.e. requirements inventory and a bloated, aging inventory)
  2. Building Test Environments
    1. Procurement hassles across server, storage, networking, firewall. Lengthy handoffs between each of these teams and differing priorities.
    2. Horror story – 250 days for one company to attempt to host a “Hello World” app. It took them just 2 hours on AWS!
  3. Testing and Fixing Defects – typically QA led
    1. Issues here with repeatability of results (i.e. false positives caused by the test harness, environment, or deployment process)
    2. Often the greatest pain point, due to reliance on manual tests causing lengthy multi-week test cycles, and the time it takes to fix the defects discovered.
  4. Production Deployment – large, cross org effort led by Ops
  5. Monitoring and Operations

The points above are why you can’t just copy the rituals from one org to another. For any given company, your pain points could be different.

 

So, how do we identify the exact issue with YOUR specific company?

  1. Development (i.e. Requirements)
    1. Metrics:
      1. What % of time is spent in planning and documenting requirements?
      2. How many man-hours of development work are currently in the inventory for all applications?
      3. What % of delivered features are being used by customers and fit the expected results?
    2. An important note here – organizations often commit 100% of dev resources to address work each sprint. This is terrible as a practice and means that the development teams are too busy meeting preset commitments to respond to changes in the marketplace or discoveries during development. The need here is for education – to tell the business to be reasonable in what they expect, and how to shape requirements so they are actual minimum functionality needed to support their business decisions. (Avoid requirements bloat due to overzealous business analysts/PM’s for example!)

  1. Provisioning environments
    1. Metrics:
      1. How much time does it take to provision environments (on avg)
      2. How many environments are requested per month/sprint
      3. % of time these environments require manual fixing before they are complete
      4. % of defects associated with non-code – i.e. environments, deployments, data layer, etc.
    2. The solution here for provisioning pinch points is infrastructure as code. Here there is no shortcut other than developers and IT/operations working together to build a working set of scripts to recreate environments and maintaining them jointly. This helps with triage as changes to environments now show up clearly in source control, and prevents DEV-QA-STG-PROD anomalies as it limits variances between environments.
    3. It’s critical here for Dev and Ops to use the same tool to identify and fix issues. Otherwise strong us vs them backlash and friction.
    4. This requires the organization to have a strong investment in tooling and think about their approach – esp with simulators/emulators for companies doing embedded development.

  1. Testing
    1. Metrics
      1. What is the time it takes to run a full set of tests?
      2. How repeatable are these? (i.e. what’s the % of false errors)
      3. What % of defects are found with testing (either manual, automated, or unit testing)
      4. What is the time it takes to approve a release?
      5. What’s the frequency of releases?
    2. In many organizations this is the most frequent bottleneck – the absurd amount of time it takes to complete a round of tests with a reasonable expectation the release will work as designed. These tests must run in hours, not days.
    3. You must choose a well-designed automation framework.
    4. Development is going to have to change their practices so the code they write is testable. And they’ll need to commit to making build stability a top priority – bugs are equal in priority (if not higher than) tasks/new features.
    5. This is the logical place to start for most organizations. Don’t just write a bunch of automated tests – instead just a few automated Build Acceptance Tests that will provide a base level of stability. Watch these carefully.
      1. If the tests reveal mostly issues with the testing harness, tweak the framework.
      2. If the tests are finding mostly infrastructure anomalies, you’ll need to create a set of post-deployment tests to check on the environments BEFORE you run your gated coding acceptance test. (i.e. fix the issues you have with provisioning, above).
      3. If you’re finding coding issues or anomalies – congrats, you’re in the sweet spot now!
    6. Horror story here – one company boasted of thousands of automated tests. However, these were found to not be stable, maintainable, and had to be junked.
    7. Improve and augment over time these BATs so your trunk quality gradually moves closer to release in terms of near-produciton quality.
      1. Issue – what about that “hot” project needed by the business (which generally arrives with a very low level of quality due to high pressure?
        1. Here the code absolutely should be folded into the release, but not exposed to the customer until it fits the new definition of done: “All the stories are signed off, automated testing in place and passing, and no known open defects.”

  1. Release to Production
    1. If a test cycle takes 6 weeks to run, and management approval takes one day – improving this part just isn’t worth it. But if you’re trying to do multiple test cycles a week and this is the bottleneck, absolutely address this with managers that are lagging in their approval or otherwise not trusting the gated testing you’re doing.
    2. Metrics
      1. Time and effort to release to production
      2. Number of issues found categorized by source (code, environment, deployment process, data, etc)
      3. Number of issues total found in production
      4. MTTR – mean time to restore service
      5. # of green builds a day
      6. Time to recover from a red build
      7. % of features requiring rework before acceptance
      8. Amt of effort to integrate code from the developers into a buildable release
    3. For #1-4 – Two areas that can help here are feature toggling (which you’ll be using anyway), and canary releases where key pieces of new functionality are turned on for a subset of users to “test in production.”
    4. For #5-6 – here Continuous Integration is the healer. This is where you avoid branching by versioning your services (and even the database – see Refactoring Databases book by Scott)
    5. For #7-8 – If you’re facing a lot of static here likely a scrum/agile coach will help significantly.

 

So – how to win, once you’ve identified the pain points? You begin by partitioning the issue:

  • Break off pieces that are tightly coupled versus not developed/tested/deployed as a unit. (i.e. HR or Purchasing processes)
  • Segment these into business critical and non-business critical.
  • Split these into tightly coupled monoliths with common code sharing requirements vs microservices (small, independent teams a la Amazon). The reality is – in most enterprises there’s very valid reasons why these applications were built the way they are, You can’t ignore this complexity, much as we’d like to say “microservices everywhere!”

I really admire Gary’s very pragmatic approach as it doesn’t try to accomplish large, difficult things all at once but it focuses on winnable wars at a company’s true pain points. Instead of trying to force large, tightly coupled organizations to work likely loosely coupled orgs – you need to understand the complex systems and determine together how to release code more frequently without sacrificing quality. Convince these teams of DevOps principles.

DevOps – Where to Start

I had a friend come to me the other day with what seems to be a simple ask. His company, a large banking enterprise, is looking into DevOps. So where should he start in building awareness?

Some context here – my friend is a programmer, a lifelong developer with high-level black belt skills in a variety of languages. He doesn’t want this to become a full time job for him – he loves coding and application development/architecture in particular. He just wants some resources to pass along.

Here’s my response:

A few words first on what DevOps is…

If you’re just getting started, there’s a decent Microsoft site out here at this address. Including a great series of videos that introduce what DevOps is and means, with some really rich content in the footnotes for next steps..

DevOps means in practical terms making sure your release pipeline from a development workstation to production is as smooth and automated as possible. So that implies:

  1. Infrastructure as Code: You have your infrastructure written out as a recipe and it’s rebuilt each time you push out code (Infrastructure as Code). Following a template enforces consistency, it’s the only sane way to handle things. The big players in this space to date are Chef and Puppet, maybe Octopus.
  2. Testing: Your testing is as rigorous as possible. This means when you do a release no person needs to look at anything but exceptions where there’s failures – your releases are gated where if there’s major bugs you’ll catch them early on and prevent a release to production. This means integration and unit testing using things like Selenium for the UI layer.
  3. Release Management: When developers check in code its continuously integrated and released. Note – this is mostly IDE based. I believe MSFT has best in class tooling here especially built on top of VSTS releases, where essentially it becomes fire and forget, a checkbox. (Remember when Agile used to be hard?)

For my open source friends – the big players in the industry right now come from the Linux community. So start in your learning efforts with Chef (https://learn.chef.io/) and Puppet (https://puppet.com/download-learning-vm) . Ansible is also a hot name. You can download VM’s and start playing with them, or run these tools on the cloud, free with Azure, and they’re Linux based, very easy on the $.

OK, That’s Great. Now What?

Well, if you want to tackle this, and you’re book oriented, I would recommend the following:

The “Gang of Four” Books:

  1. Get “The Phoenix Project” by Gene Kim. This is great in particular for you executive/leadership types. Think about leaving this on the desk of decisonmakers you know if you need support for your DevOps initiative.
  2. Another, very practical book is “Leading the Transformation” by Gary Gruver and Tommy Mouser. This is a much more connected approach on how one leader found a way around serious organizational constraints – yes your efforts will make enemies if handled badly! – by chaining it to specific business (not technical) phased objectives. A must read.
  3. You developers out there should already have “Continuous Development” by Jez Humble on your bookshelves. It’s a modern classic and explains why developers should be 100% onboard with RM and continuous delivery movements.
  4. IT people need to pick up a copy of “Visible Ops” by Gene Kim, very prescriptive and outstanding in basing your transformation on key IT/operations based KPI’s. Nonfuzzy, clear, short and sweet at about 100 pages in a little booklet. I love it.

Yeah, I’m more into videos. Books are so 90’s, dude.

OK, well do you have 12 minutes? Check out this intro with Donovan Brown,  and an excellent three part series on Release Management – Part 1 (overview), Part 2 (RM architecture), and Part 3 (release pipelines). Outstanding, and will give you a nice overview of setting up Continuous Integration and build pipelines.

My blog has some links on “All Happy Families Are Alike“, “Devopoly“, “Cats and Dogs Living Together“, and “The Five Dysfunctions of DevOps“. These are lengthy but put together will give anyone a good overview of the Phoenix Project and Visible Ops.

Now We’re Getting Started…

Well that’s enough to at least whet the appetite.

Here’s the three things I’d like you to come away with:

  1. DevOps is a big effort, you will need help. You can’t do it grass roots. It will require strong commitment by management and the understanding that this will require both time and money. If you feel that you lack that level of commitment, manage expectations or scrub the effort until the conditions are more favorable. Likely, you will need some experienced help to form a roadmap and get buy-in, and coach/mentor so the first few months go smoothly. You will also need to commit time and effort to mastering and maintaining your code for both testing and building out your infrastructure. (Hopefully, your releases themselves will be mostly code-free).
  2. Build maturity through better testing. Your gated releases are going to need a high level of assurance that your builds are functional. So building up your QA maturity is one big investment that will pay huge dividends in avoiding production mishaps and environmental anomalies that come through manual deployment methods.
  3. Infrastructure As Code is where its at. As long as environments are manually provisioned, you have a vector for errors and time-sucking anomalies. Once you start writing out environments as recipes and going away from manual patching to destroying/rebuilding environments along with your production releases – you’ll never, ever go back. It rocks!

Thanks guys, hope this is helpful to you with those first few steps on your journey!

Portland 2016 DevOps day – wow, thanks!

Guys, had SUCH a blast last Friday at the DevOps roadshow!

Here’s some pix. I really owe Monu Bambroo, Derrick Cawthorn and the amazing Donovan Brown for coming down and buying out their time to spread awareness of DevOps and the answers we have here at Microsoft for this sea change.

If you’re interested in more, give me a holler. We do have that workshop on “DevOps Fundamentals” that in three days goes through setting up a complete release pipeline – way cool!

 

Some link goodness for you:

  1. Donovan’s site: http://donovanbrown.com/ Search for DevOps. There’s something for everybody at this site. For example, here’s a post describing how he went about setting up a demo for a group in New Zealand using Docker, Ubuntu Linux, Visual Studio, Selenium, etc. Way cool! There’s another good link here for how Deployment Slots play into your DevOps pipeline, another on “how many vendors does it take to implement DevOps?”, triggering a rollback based on user feedback during a release, where Powershell DSC fits in.

 
 

  1. Dave Harrison’s site is here. www.driftboatdave.com. I’ve got some links here on “All Happy Families Are Alike“, “Devopoly“, “Cats and Dogs Living Together“, and “The Five Dysfunctions of DevOps“. These are lengthy but put together will give anyone a good overview of the Phoenix Project and Visible Ops.

 
 

  1. Last, may I recommend Channel 9? Here’s a 12 minute intro with Donovan Brown,  and an excellent three part series on Release Management – Part 1 (overview), Part 2 (RM architecture), and Part 3 (release pipelines). Outstanding, and will give you a nice overview of what we covered during DevOps Day in setting up Continuous Integration and build pipelines.

     

 

 

DevOpoly!

This is the fourth of a series on DevOps. The first focused on the three ways explored in the Phoenix Project, and I stuck in some thoughts from the Five Dysfunctions of a Team by Lencioni. The second discussed the lessons taught by GM’s failure in adopting Toyota’s Lean processes with their NUMMI plant. The third went through some great lessons I’ve learned from “Visible Ops” by Gene Kim.

“The single largest improvement an IT organization can benefit from is implementing repeatable system builds. This can’t be done without first managing change and having an accurate inventory. When you convert a person-centric and heavily manual process to a quick and repeatable mechanism, the reaction is always positive. Even a partially automated release/build process greatly improves the ability for individuals to be freed from firefighting and focus on their areas of real value. And by making it more efficient to rebuild than repair, you also get much faster systems downtime and significantly reduced downtime.” (Joe Judge, Adero)

 

 

So I am putting together a presentation for PADNUG tomorrow on DevOps. I’ve reworked this presentation like three times, and I’ve never been very happy with it. Let’s just say Steve Jobs would have rolled his eyes at something like this:

Look at that crap above. I mean, there’s information here – but way too MUCH information. There’s no way any audience is going to absorb this. I’ll lose them halfway through the second bullet point.

So, I was struggling with this a few weeks ago, trying to come up with a better idea. And I was watching my kids play Monopoly. And I started to think – since there’s no recipe for DevOps, and you can choose your own course, and some amount of it is up to chance or your individual circumstances – well, isn’t that a game? (And isn’t that a more fun way of learning than using an endless stream of bullet points?)

So, DevOpoly was born!

Let’s take a look at this in blocks shall we?

  • MTTR – Mean Time to Repair. This indicates how robust you are, how quickly you can respond and react to an issue.
  • Stakeholder Signoff – this is after you inventory your applications – instituting any change management policy and change window will require the business to provide signoff.
  • Inventory Apps – listing applications, servers, systems and services in tiers. This is a prereq for getting your problem children identified and frozen, see below.
  • CAB Weekly Meetings – I used to think these were a complete and total waste of time. In fact several books I have claim that they don’t measurably reduce defects and slow down development – bureaucracy at its worst. But, Gene Kim swears by it – and he thinks it’s a base level requirement for change management culture.

  • Versioned Patches – Putting any software patches into source control
  • Security Auditing – having controls that are visible, verifiable, regularly reported
  • Configuration Management – Infrastructure as Code, a key part of implementing repeatable system builds, using software like Puppet, Chef, Octopus etc.
  • Golden Build – The end goal and the building block of a release library, a set of ‘golden builds’ that are verifiable and QA’d. The length of time that these builds stay stable is another metric helpful in determining reliability of your apps.

  • Feed to Trouble Ticket – Creating a system where any changes – authorized or unauthorized – show up in trouble ticket for first responders to access. % Success rate in first response in diagnosis is a key metric for DevOps.
  • Dashboarding – creating visibility around these metrics (see stage 3 of the Phoenix Project post) is the only way you’ll know if you’re making progress – and securing management support.
  • Form RM Team – This is part of the process in moving more staff away from firefighting and early in the release process. Mature, capable orgs have more personnel assigned to protect quality early on versus catching defects late.

 

  • MTBF – Mean Time Between Failures. As configuration management knocks out snowflake servers and fragile artifacts are frozen, this number should go up.
  • Automated Release – creating a release management pipeline of dev bits from DEV-QA-STG-PROD, with as much automated signoff as possible using automated tests, is a great step forward.
  • Gated Builds – See above, but having functional/integration testing and unit tests run on checkin is key to prevent failures.
  • Continuous Integration – bound up with testing and the RM cycle – having any dev changes get checked in and validated and merged safely with other development changes. (And, remember, CI means the barest amount of release branching possible. It’s a tough balance.)

  • Eliminate Access – Actually I don’t know many devs (besides the true cowboys) that really WANT access to production. But, removing access to all but change managers is a key step. And when you’re done with that…
  • Electrify the Fence – Have change policy known and discipline the (inevitable) slow learners. Not fire them. Maybe have a few “disappear” in suspicious accidents, to warn the others!
  • Monitor Changes – Use some software (like Tripwire maybe?) to monitor any and all changes to the servers.
  • Server to Admin Ratio – Typically this is a 15:1 ratio – but for high performing orgs with an excellent level of change management, 100:1 or greater is the norm.

  • Document Policy – Writing out the change management policy is a key to electrifying the fence and preventing the org from slipping back into bad habits.
  • Rebuild Not Repair – With a great release library of golden builds and a minimal amount of unique configs and templates, infrastructure is commonly rebuilt – not patched and limping along.

  • Find Fragile Artifacts – Once you’ve done your systems inventory, you can document the systems that have the lowest uptime, the highest impact to the business when its down, and the most expensive infrastructure.
  • Enforce Change Window – Set a change window for each set of your applications, and freeze any and all changes outside of that window. It must be documented and stakeholders must provide signoff.
  • Soft Freeze Fragile Systems – These fragile artifacts have to be frozen, one by one, until the environments can be safely replicated and maintained. This soft freeze can’t last long until the systems are part of configuration management/IAC.

  • Accountability – #1 of the two failure points in any change. True commitment and accountability from each person involved.
  • Firefighting Tax – Less than 5% of time spent in firefighting is a great metric to aim for. Most organizations are at about 40%.
  • Management Buy-In – DevOps can be started as a grassroots effort, but for it to be successful- it must have solid buy-in from the top. Past a pilot effort, you must secure management approval by publicizing your dashboards and key metrics.

Anyway, this was fun. I have some cards on the way for both the Gene Kim Chest – yes, not Jez Humble, but I’m thinking about it – and Chance. Lots of chance in the whole DevOps world.

(I tried this back in August with Life but it never worked by the way.)

 

 

“All Happy Families Are Alike” – Visible Ops by Gene Kim review

This is the third of a series of three posts I’ve done on DevOps recently. The first focused on the three ways explored in the Phoenix Project, and I stuck in some thoughts from the Five Dysfunctions of a Team by Lencioni. The second discussed the lessons taught by GM’s failure in adopting Toyota’s Lean processes with their NUMMI plant. This one will go through some great lessons I’ve learned from a terrific – and very short and readable – little book entitled “Visible Ops” by Gene Kim. Please, order this book (just $17 on Amazon!) and give it some thought.

“The single largest improvement an IT organization can benefit from is implementing repeatable system builds. This can’t be done without first managing change and having an accurate inventory. When you convert a person-centric and heavily manual process to a quick and repeatable mechanism, the reaction is always positive. Even a partially automated release/build process greatly improves the ability for individuals to be freed from firefighting and focus on their areas of real value. And by making it more efficient to rebuild than repair, you also get much faster systems downtime and significantly reduced downtime.” (Joe Judge, Adero)

I was always struck by the phrase from Tolstoy – “All happy families are alike, every unhappy family is unhappy in its own way.” Turns out that’s true of DevOps as well. Successful companies, it turns out, have some very common threads in terms of IT:

  • High service levels and availability
    • Mean Time To Repair (MTTR)
    • Mean Time Between Failures (MTBF)
  • High throughput of effective change
    • Change success rate >99% (for example, amazon with 1500+ changes a week)
  • Tight collaboration between dev, Ops/IT, QA team, and security auditors
    • Controls are visible, verifiable, regularly reported
  • Low amt of unplanned work
    • <5% of time spent firefighting – typical is 40%
  • Systems highly automated and hands-free
    • Server to System Admins ratio 100:1 or greater (typical 15:1)

 

So what are the common factors with the happy families” that have these highly efficient, repeatable RM culture?

  • A change management culture
    • Management by fact versus belief
    • All changes go through a formal change management process
      • “The only acceptable number of unvetted change is zero.”
      • “Change management is important to us, because we are always one change away from being a low performer.”
      • “Perceptions of nimbleness and speed are a delusion if you are tied down in firefighting.”
      • “The biggest failure in any process engineering effort is accountability and true management commitment to the process.”
  • No voodoo – causality over gut feel
    • Trouble ticket systems – inside each ticket are all scheduled changes and all detected changes with the system.
      • This leads to 90% first fix rate and 80% success rate in initial diagnosis
  • Human Factors Come First in Continual Improvement
    • Strong desire to find production variance early
    • Controls to find variance, preventative and detective.

Every unhappy family though is unhappy in their own way. You’ll hear sayings like the following in these “DevOps won’t work for us, we’re unique and special” type organizations:

  • “80% of our outages are due to changes – and 80% of the time we take in implementing a repair is trying to find that change” – Gartner
  • Data and continual improvement takes a back seat to intuition, gut feel, highly skilled IT Ops staff
  • SLA not met
  • “Most of our work is caused by self-inflicted problems and uncontrolled changes. Each sprint I start with a blank slate, and each sprint ends with 50% of my development firepower getting sucked away into firefighting.”
  • Infrastructure is repaired not rebuilt- “priceless works of art”
  • System failures happening at worst possible time, IT’s rep is damaged
  • Changes have a long fuse
  • One change can undo a series of change(s)

So how does an unhappy family move towards becoming more functional? Gene Kim has broken it down into four logical steps.

  • Phase 1 – Stabilize the Patient
    • Freeze changes outside maintenance window
    • First responders have all change related data at hand
  • Phase 2 – Find the Problem Child
    • Inventory your systems and identify systems with low change success, high repair time, high downtime business impact
  • Phase 3 – Grow your Repeatable Build Library
  • Phase 4 – Enable continuous Improvement

In a little more detail:

  • Phase 1 – Stabilize The Patient
    • Beginning of step for Goal is to allow highest possible change throughput with least amount of bureaucracy possible. No rubber stamping, change request tracking system feeds info to first responders, ensure solid backup plan.
    • Inventory applications and identify stakeholders and systems
    • Document new change management policy and change window with stakeholders
    • Institute weekly change management meetings
    • Eliminate access to all but authorized change managers
    • Electrify the fence with instrumentation, monitoring
      • you’ll be shocked at what you find!
      • this prevents org from falling back into bad old habits, like a rock climber with a ratchet and rope
    • Failure Points
      • We won’t be able to get anything done!
      • The business pays us to make changes. Not to sit in boring CM meetings.
      • We trust our own people – they’re professionals and don’t need micromanaging.
      • We already tried that – it didn’t work
      • We believe there are no unauthorized changes.
  • Phase 2 – Find The Problem Children
    • Analyze assets, find fragile artifacts (use list from Phase 1)
    • Must be fast. Can’t freeze changes forever.
    • Soft freeze, where truly urgent changes during this period go through CAB.
    • Failure Points
      • Pockets of knowledge and proficiency
      • Servers are snowflakes – irreplaceable artifacts of mission critical infrastructure
  • Phase 3 – Grow Your Repeatable Build Library
    • Create a RM team. (Shifts team to pre-prod activities)
    • Take fragile artifacts in priority – create golden builds stored in software library
    • Separation of roles – devs have no access to production
    • Amount of unplanned changes (and related work) further drops
    • # of unique configurations in deployment drops, increasing server/admin ratio
    • Mitigated the “patch and pray” dilemma, updates integrated into the RM process for patches to be tested and safely rolled out
  • Phase 4 – Enable Continuous Improvement
    • This has to do with gathering metrics and measuring improvement along three lines – release, controls, and resolution.

  • Release – how efficiently and effectively can we generate and provision infrastructure?
    • Time to provision known good builds
    • Number of turns to a known good build
    • Shelf life of a build
    • % of systems that match known good builds
    • % of builds with security signoff
    • # of fast-tracked builds
    • Ratio of Release Engineers to System Admins
  • Controls – how effectively do we make good change decisions that keep infrastructure available, predictable and secure?
    • # of changes authorized per week
    • # of actual changes made per week
    • # of unauthorized changes
    • Change success rate
    • Number of unauthorized changes
    • Changes submitted vs changes reviewed
    • Change success rate
    • Number of service-affecting outages
    • Number of emergency changes or “special” changes
    • Change management overhead (measure bureaucracy, lower is better!)
  • Resolution – when things go wrong, how effectively do we diagnose and resolve issue?
    • MTTR – Mean Time To Repair
    • MTBF – Mean Time Between Failure