DevOps

Portland 2016 DevOps day – wow, thanks!

Guys, had SUCH a blast last Friday at the DevOps roadshow!

Here’s some pix. I really owe Monu Bambroo, Derrick Cawthorn and the amazing Donovan Brown for coming down and buying out their time to spread awareness of DevOps and the answers we have here at Microsoft for this sea change.

If you’re interested in more, give me a holler. We do have that workshop on “DevOps Fundamentals” that in three days goes through setting up a complete release pipeline – way cool!

Some link goodness for you:

Donovan’s site: http://donovanbrown.com/ Search for DevOps. There’s something for everybody at this site. For example, here’s a post describing how he went about setting up a demo for a group in New Zealand using Docker, Ubuntu Linux, Visual Studio, Selenium, etc. Way cool! There’s another good link here for how Deployment Slots play into your DevOps pipeline, another on “how many vendors does it take to implement DevOps?”, triggering a rollback based on user feedback during a release, where Powershell DSC fits in.

Dave Harrison’s site is here. www.driftboatdave.com. I’ve got some links here on “All Happy Families Are Alike“, “Devopoly“, “Cats and Dogs Living Together“, and “The Five Dysfunctions of DevOps“. These are lengthy but put together will give anyone a good overview of the Phoenix Project and Visible Ops.

Last, may I recommend Channel 9? Here’s a 12 minute intro with Donovan Brown, and an excellent three part series on Release Management – Part 1 (overview), Part 2 (RM architecture), and Part 3 (release pipelines). Outstanding, and will give you a nice overview of what we covered during DevOps Day in setting up Continuous Integration and build pipelines.

DevOpoly!

This is the fourth of a series on DevOps. The first focused on the three ways explored in the Phoenix Project, and I stuck in some thoughts from the Five Dysfunctions of a Team by Lencioni. The second discussed the lessons taught by GM’s failure in adopting Toyota’s Lean processes with their NUMMI plant. The third went through some great lessons I’ve learned from “Visible Ops” by Gene Kim.

“The single largest improvement an IT organization can benefit from is implementing repeatable system builds. This can’t be done without first managing change and having an accurate inventory. When you convert a person-centric and heavily manual process to a quick and repeatable mechanism, the reaction is always positive. Even a partially automated release/build process greatly improves the ability for individuals to be freed from firefighting and focus on their areas of real value. And by making it more efficient to rebuild than repair, you also get much faster systems downtime and significantly reduced downtime.” (Joe Judge, Adero)

So I am putting together a presentation for PADNUG tomorrow on DevOps. I’ve reworked this presentation like three times, and I’ve never been very happy with it. Let’s just say Steve Jobs would have rolled his eyes at something like this:

Look at that crap above. I mean, there’s information here – but way too MUCH information. There’s no way any audience is going to absorb this. I’ll lose them halfway through the second bullet point.

So, I was struggling with this a few weeks ago, trying to come up with a better idea. And I was watching my kids play Monopoly. And I started to think – since there’s no recipe for DevOps, and you can choose your own course, and some amount of it is up to chance or your individual circumstances – well, isn’t that a game? (And isn’t that a more fun way of learning than using an endless stream of bullet points?)

So, DevOpoly was born!

Let’s take a look at this in blocks shall we?

MTTR – Mean Time to Repair. This indicates how robust you are, how quickly you can respond and react to an issue.
Stakeholder Signoff – this is after you inventory your applications – instituting any change management policy and change window will require the business to provide signoff.
Inventory Apps – listing applications, servers, systems and services in tiers. This is a prereq for getting your problem children identified and frozen, see below.
CAB Weekly Meetings – I used to think these were a complete and total waste of time. In fact several books I have claim that they don’t measurably reduce defects and slow down development – bureaucracy at its worst. But, Gene Kim swears by it – and he thinks it’s a base level requirement for change management culture.

Versioned Patches – Putting any software patches into source control
Security Auditing – having controls that are visible, verifiable, regularly reported
Configuration Management – Infrastructure as Code, a key part of implementing repeatable system builds, using software like Puppet, Chef, Octopus etc.
Golden Build – The end goal and the building block of a release library, a set of ‘golden builds’ that are verifiable and QA’d. The length of time that these builds stay stable is another metric helpful in determining reliability of your apps.

Feed to Trouble Ticket – Creating a system where any changes – authorized or unauthorized – show up in trouble ticket for first responders to access. % Success rate in first response in diagnosis is a key metric for DevOps.
Dashboarding – creating visibility around these metrics (see stage 3 of the Phoenix Project post) is the only way you’ll know if you’re making progress – and securing management support.
Form RM Team – This is part of the process in moving more staff away from firefighting and early in the release process. Mature, capable orgs have more personnel assigned to protect quality early on versus catching defects late.

MTBF – Mean Time Between Failures. As configuration management knocks out snowflake servers and fragile artifacts are frozen, this number should go up.
Automated Release – creating a release management pipeline of dev bits from DEV-QA-STG-PROD, with as much automated signoff as possible using automated tests, is a great step forward.
Gated Builds – See above, but having functional/integration testing and unit tests run on checkin is key to prevent failures.
Continuous Integration – bound up with testing and the RM cycle – having any dev changes get checked in and validated and merged safely with other development changes. (And, remember, CI means the barest amount of release branching possible. It’s a tough balance.)

Eliminate Access – Actually I don’t know many devs (besides the true cowboys) that really WANT access to production. But, removing access to all but change managers is a key step. And when you’re done with that…
Electrify the Fence – Have change policy known and discipline the (inevitable) slow learners. Not fire them. Maybe have a few “disappear” in suspicious accidents, to warn the others!
Monitor Changes – Use some software (like Tripwire maybe?) to monitor any and all changes to the servers.
Server to Admin Ratio – Typically this is a 15:1 ratio – but for high performing orgs with an excellent level of change management, 100:1 or greater is the norm.

Document Policy – Writing out the change management policy is a key to electrifying the fence and preventing the org from slipping back into bad habits.
Rebuild Not Repair – With a great release library of golden builds and a minimal amount of unique configs and templates, infrastructure is commonly rebuilt – not patched and limping along.

Find Fragile Artifacts – Once you’ve done your systems inventory, you can document the systems that have the lowest uptime, the highest impact to the business when its down, and the most expensive infrastructure.
Enforce Change Window – Set a change window for each set of your applications, and freeze any and all changes outside of that window. It must be documented and stakeholders must provide signoff.
Soft Freeze Fragile Systems – These fragile artifacts have to be frozen, one by one, until the environments can be safely replicated and maintained. This soft freeze can’t last long until the systems are part of configuration management/IAC.

Accountability – #1 of the two failure points in any change. True commitment and accountability from each person involved.
Firefighting Tax – Less than 5% of time spent in firefighting is a great metric to aim for. Most organizations are at about 40%.
Management Buy-In – DevOps can be started as a grassroots effort, but for it to be successful- it must have solid buy-in from the top. Past a pilot effort, you must secure management approval by publicizing your dashboards and key metrics.

Anyway, this was fun. I have some cards on the way for both the Gene Kim Chest – yes, not Jez Humble, but I’m thinking about it – and Chance. Lots of chance in the whole DevOps world.

(I tried this back in August with Life but it never worked by the way.)

“All Happy Families Are Alike” – Visible Ops by Gene Kim review

This is the third of a series of three posts I’ve done on DevOps recently. The first focused on the three ways explored in the Phoenix Project, and I stuck in some thoughts from the Five Dysfunctions of a Team by Lencioni. The second discussed the lessons taught by GM’s failure in adopting Toyota’s Lean processes with their NUMMI plant. This one will go through some great lessons I’ve learned from a terrific – and very short and readable – little book entitled “Visible Ops” by Gene Kim. Please, order this book (just $17 on Amazon!) and give it some thought.

“The single largest improvement an IT organization can benefit from is implementing repeatable system builds. This can’t be done without first managing change and having an accurate inventory. When you convert a person-centric and heavily manual process to a quick and repeatable mechanism, the reaction is always positive. Even a partially automated release/build process greatly improves the ability for individuals to be freed from firefighting and focus on their areas of real value. And by making it more efficient to rebuild than repair, you also get much faster systems downtime and significantly reduced downtime.” (Joe Judge, Adero)

I was always struck by the phrase from Tolstoy – “All happy families are alike, every unhappy family is unhappy in its own way.” Turns out that’s true of DevOps as well. Successful companies, it turns out, have some very common threads in terms of IT:

High service levels and availability
- Mean Time To Repair (MTTR)
- Mean Time Between Failures (MTBF)
High throughput of effective change
- Change success rate >99% (for example, amazon with 1500+ changes a week)
Tight collaboration between dev, Ops/IT, QA team, and security auditors
- Controls are visible, verifiable, regularly reported
Low amt of unplanned work
- <5% of time spent firefighting – typical is 40%
Systems highly automated and hands-free
- Server to System Admins ratio 100:1 or greater (typical 15:1)

So what are the common factors with the happy families” that have these highly efficient, repeatable RM culture?

A change management culture
- Management by fact versus belief
- All changes go through a formal change management process
  - “The only acceptable number of unvetted change is zero.”
  - “Change management is important to us, because we are always one change away from being a low performer.”
  - “Perceptions of nimbleness and speed are a delusion if you are tied down in firefighting.”
  - “The biggest failure in any process engineering effort is accountability and true management commitment to the process.”
No voodoo – causality over gut feel
- Trouble ticket systems – inside each ticket are all scheduled changes and all detected changes with the system.
  - This leads to 90% first fix rate and 80% success rate in initial diagnosis
Human Factors Come First in Continual Improvement
- Strong desire to find production variance early
- Controls to find variance, preventative and detective.

Every unhappy family though is unhappy in their own way. You’ll hear sayings like the following in these “DevOps won’t work for us, we’re unique and special” type organizations:

“80% of our outages are due to changes – and 80% of the time we take in implementing a repair is trying to find that change” – Gartner
Data and continual improvement takes a back seat to intuition, gut feel, highly skilled IT Ops staff
SLA not met
“Most of our work is caused by self-inflicted problems and uncontrolled changes. Each sprint I start with a blank slate, and each sprint ends with 50% of my development firepower getting sucked away into firefighting.”
Infrastructure is repaired not rebuilt- “priceless works of art”
System failures happening at worst possible time, IT’s rep is damaged
Changes have a long fuse
One change can undo a series of change(s)

So how does an unhappy family move towards becoming more functional? Gene Kim has broken it down into four logical steps.

Phase 1 – Stabilize the Patient
- Freeze changes outside maintenance window
- First responders have all change related data at hand
Phase 2 – Find the Problem Child
- Inventory your systems and identify systems with low change success, high repair time, high downtime business impact
Phase 3 – Grow your Repeatable Build Library
Phase 4 – Enable continuous Improvement

In a little more detail:

Phase 1 – Stabilize The Patient
- Beginning of step for Goal is to allow highest possible change throughput with least amount of bureaucracy possible. No rubber stamping, change request tracking system feeds info to first responders, ensure solid backup plan.
- Inventory applications and identify stakeholders and systems
- Document new change management policy and change window with stakeholders
- Institute weekly change management meetings
- Eliminate access to all but authorized change managers
- Electrify the fence with instrumentation, monitoring
  - you’ll be shocked at what you find!
  - this prevents org from falling back into bad old habits, like a rock climber with a ratchet and rope
- Failure Points
  - We won’t be able to get anything done!
  - The business pays us to make changes. Not to sit in boring CM meetings.
  - We trust our own people – they’re professionals and don’t need micromanaging.
  - We already tried that – it didn’t work
  - We believe there are no unauthorized changes.
Phase 2 – Find The Problem Children
- Analyze assets, find fragile artifacts (use list from Phase 1)
- Must be fast. Can’t freeze changes forever.
- Soft freeze, where truly urgent changes during this period go through CAB.
- Failure Points
  - Pockets of knowledge and proficiency
  - Servers are snowflakes – irreplaceable artifacts of mission critical infrastructure
Phase 3 – Grow Your Repeatable Build Library
- Create a RM team. (Shifts team to pre-prod activities)
- Take fragile artifacts in priority – create golden builds stored in software library
- Separation of roles – devs have no access to production
- Amount of unplanned changes (and related work) further drops
- # of unique configurations in deployment drops, increasing server/admin ratio
- Mitigated the “patch and pray” dilemma, updates integrated into the RM process for patches to be tested and safely rolled out
Phase 4 – Enable Continuous Improvement
- This has to do with gathering metrics and measuring improvement along three lines – release, controls, and resolution.

Release – how efficiently and effectively can we generate and provision infrastructure?
- Time to provision known good builds
- Number of turns to a known good build
- Shelf life of a build
- % of systems that match known good builds
- % of builds with security signoff
- # of fast-tracked builds
- Ratio of Release Engineers to System Admins
Controls – how effectively do we make good change decisions that keep infrastructure available, predictable and secure?
- # of changes authorized per week
- # of actual changes made per week
- # of unauthorized changes
- Change success rate
- Number of unauthorized changes
- Changes submitted vs changes reviewed
- Change success rate
- Number of service-affecting outages
- Number of emergency changes or “special” changes
- Change management overhead (measure bureaucracy, lower is better!)
Resolution – when things go wrong, how effectively do we diagnose and resolve issue?
- MTTR – Mean Time To Repair
- MTBF – Mean Time Between Failure

The Five Dysfunctions of DevOps

I remember laughing at the American car companies in the 80’s that – panicked by the unmatched quality coming out of Toyota – sent spies and emissaries out to Japan to emulate what was being done in the factories. They were given complete access, took it back to America – and it fell flat on its face. The Japanese product managers implementing Lean in the manufacturing floors snickered that they were copying the image of Buddha without the spirit. How could they ever implement something they didn’t understand? In part, those American car company manufacturers missed the essence of kata, or continuous improvement through repetition. By neglecting culture, any tool or process they tried just ended up in the same dead end.

I just finished reading the Phoenix Project by Gene Kim etc – oddly enough, on a trip out to Phoenix – and found myself wanting to smack myself in the forehead. There are so MANY things about DevOps that I did not understand, even a year ago. It would have made the last five years of my life immeasurably smoother – if I had understood the principles and thoughts behind what I was trying to do. (Insert cargo cult joke here)

We don’t have a DevOps Manifesto yet – and one is badly needed. In the meantime, we have two books that sum things up. If you haven’t read the Old Testament and New Testament of DevOps – that’s the Phoenix Project and Continuous Delivery by Jez Humble – you are missing out. The Phoenix Project is ¾ management-speak – a hero leader who steps in and methodically saves a failing company, you know the old story of the guy pointing majestically at the sunset on a horse? But, here’s the thing – if you want to convince CxO type people of the importance of DevOps, you need to read this book. It speaks the language of management – so it will help you tell a story that your management and the CIO would want to hear. And buried in its pages are some real depth.

Creating Flow – the Three Ways

I used to show a picture of Hillsboro, Oregon on the 26 during rush hour. This is, no one will argue, a fully utilized freeway. But, is it efficient?

The key to the Toyota Lean principles – and Kanban and Agile and everything else – is creating that flow. That means a buffer in every day, in every week, where we have space to think about how work is done – not just the what.

What	How
The First Way: Maximizing flow with small batch sizes and intervals of work, never passing defects downstream, global goals	Continuous build, integration and deployment Creating environments on demand Limiting Work in Process Building systems that are safe to change
The Second Way: Establishing quality at the source. Constant feedback from right to left, ensuring we prevent problems from happening again and enabling faster detection/recovery.	Stopping the line when builds/tests fail Fast automated test suites Shared goals/pain between Dev / IT Pervasive production telemetry showing if customer goals are met
The Third Way: Creating a culture that fosters experimentation and risk.	High trust culture versus command and control Allocation >20% of Dev/Ops cycles towards nonfunctional requirements Constant reinforcement of DevOps CoE and improvements kata

This is really illuminating. For example, think of the “stopping the line” item above for the Second Way. How many times did I – in previous assignments – take any bugs from the previous release and kick it to last in order, behind the fun stuff that I really wanted to work on? Even in smaller teams of three – where I thought “we’ll never step on each others toes” – how many integration issues did we have, right before important demos? And by neglecting automated testing, how many defects did I end up passing downstream – creating systems that were inherently difficult to change?

Dysfunction and DevOps – The Importance of culture

This has been mentioned before in my posts – but notice (courtesy of Puppet from a study done by Westrum in 2004) the illuminating chart of the three types of organizations:

Now notice the Five Dysfunctions of a Team by Patrick Lencioni:

Absence of trust (unwilling to be vulnerable within the group)
Fear of conflict (seeking artificial harmony over constructive passionate debate)
Lack of commitment (feigning buy-in for group decisions, creating ambiguity)
Avoidance of accountability (ducking the responsibility to call peers on counterproductive behavior, which sets low standards)
Inattention to results (focusing on personal success, status and ego before the team)

Notice anything interesting? Let’s match up the sick organization on the far left – the power-based Pathological one – with that list of five dysfunctions:

So now we get a glimmer of light on why organizations with high-performing IT departments tend to be high-performing organizations – and why the reverse is also true, a sick IT shop, or one enslaved to the business or at the mercy of a cowboy group of developers, is a good indicator of underperformance. Companies that embrace DevOps as a culture tend to be high-trust, risk-friendly. They’re not afraid of differing opinions or radical ideas like Netflix’s Evil Chaos Monkey. People tend to waste less energy taking potshots at other teams/departments – and more attention to the common shared goal.

As the Phoenix Project brings out, the relationship between a CEO and a CIO is like a dysfunctional marriage – both sides feel powerless and held hostage by the other. This is true of Dev and Ops as well – and I’ve been in that sick marriage more than once. The essence of the book is forming a strong bond where the union becomes much closer by sharing goals and work based on company needs.

Other Thoughts from The Book

Common Agile Myths

DevOps is just automation or infrastructure as code (no, that’s the tool – it’s part of it, but not the whole)
DevOps replaces Agile (DevOps is meant to complete Agile – where bits aren’t just going into QA, but out the door to production)
DevOps replaces ITIL/ITSM (It embodies ITIL concepts)
DevOps means NoOps (DevOps means a truly empowered, nonsiloed Ops)
DevOps is only for startups
DevOps is only for open source software

On #5 and #6 above, this comes down to “We can’t do DevOps, because we’re special/unique/a snowflake.” But think of some of the companies today that are leading in the DevOps world and where they were just a few years ago:

Amazon until 2001 ran on OBIDOS content delivery service, dangerous and problematic to maintain
Twitter struggled to scale frontend monolithic Ruby on Rails system – took multiple years to rewrite
LinkedIn in 2011 six months after IPO had to freeze features for massive overhaul
Etsy in 2009 was “living in a sea of their own engineering filth”
Facebook in 2009 at breaking point, staff continually firefighting, releases painful and dangerous

I must read The Goal by Eli Goldratt. I love the thought – there is always a constraint or bottleneck in any organization (men, material, machines) that dictates the output of the entire system. Until you create a system that manages the flow of work to the constraint, the constraint is constantly wasted – and likely drastically underutilized. Technical debt skyrockets, and you can’t deliver to the business at full capacity. A following step is to exploit the constraint – where its not allowed to waste any time, ever. It should never be waiting on anything, and it should always be working on the highest priority commitment IT made to the rest of the enterprise.

DevOps and the game of Life.

Remember the old game of Life?

Pretty discouraging by the way. You work, you work, go to college / don’t go to college, pick up fellow travelers/family members – and at the end they add up your score. You either end up in a nice big house or a smaller one, and – what? Is that “winning”? Valuing things just by the money you’ve earned along the way, or the house you get with creaky knees and an enlarged prostate – well, that seems pretty empty to me.

The last time I gave a presentation on DevOps, I remember thinking how short I came up. I was talking about how certain cultures are very resistant to change. Most of the audience were died-in-the-wool developers, and had no problems jumping on the DevOps bandwagon. But they were frustrated at the lack of power they had to change the culture they were in. I remember making some noises about “keeping on trying” and the like.

I can say that I have seen even very resistant cultures change over time. And there’s been some great articles on building up a community of practice on DevOps from the ground up. So, free thinking a little, I went thru that blog post on guerilla-type subversive DevOps efforts – and combined it with the excellent writeup on some anti-patterns, and tried to make a game of it.

I was only mildly successful. See below – that’s as far as I’m getting for now. It’s a pretty lame game. Needs some work. But, there it is…

Between this and the articles I’m looking through on our new Release Management capabilities – it’s busy times here! Hope you are doing well as well.

	sdorsett on GitHub Copilot and App Mo…
	Roy K on DevOps Stories – an Interview…
	jweers on Thriving in a time of cha…
	jweers on Thriving in a time of cha…
	Jess’s Unfiltered on Thriving in a time of cha…

/* driftboatdave */

adventures in cloud architecture, DevOps, and configuration management