Only YOU Can Make This The Best Anniversary Evuh.

Anniversaries for us are kinda a big deal. For example, this year I went to Sur La Table (pronounced “surla Taaahhhhb”), and loaded up on expensive French lasagna pans in the vain hopes of finally topping Jennifer. Nope, she crushes me with a fiberglass fly fishing rod and a 1930’s antique Royal typewriter. Dammit!

So my idea for the 22nd anniversary is somehow to convince Jennifer that I need a Van Mural style painting on the wall of our home. Think like this:

The actual idea for this came from Big Hero 6, with Fred’s painting. To me this was the funniest part of what was a very, VERY good movie. I’m not going to lie, I really want this on my wall.

… or really just about anything from this post.

Anyway people send me your cash today so we can make this happen. Think about the tears of joy from my wife’s eyes as she casts her gaze upon a 6′ x 10′ reproduction of me on a sabertooth tiger carrying the shrunken heads of my enemies or something. If you contribute $1000, you will be named a Gold Sponsor, meaning you can have your face painted in the background of this timeless work of art. For example, as part of my fearsome Pyramid of Skulls, or (for you ladies), maybe as a buxom waif clutching my legs as I mow down my enemies with my battleax Gortha the Soulhammer. You get the idea. Contribute today!

Microsoft Hearts Linux. No, really, we do!

When I used to think about Linux from my bias as a Windows developer I used to think about posts like this (courtesy PennyArcade):

Yes, all the way back in the dusts of time – say 2002. Well, things have changed a lot since then. We have all kinds of great Linux integration points. And I must say – from my experiments and fumbling around with Linux – I remain very impressed with it as an ecosystem. I can’t think of a better central point when it comes to running lean, mean DevOps – Chef and Puppet run natively on it and setup couldn’t be easier. And I love the control. All that CLI goodness just takes me back to my first days writing programs on my old C64.

My good friends Jeremy Rule and Rich Maines wrote recently on Microsoft embracing Linux: “Did you know that 20% of virtual machines on Azure are Linux today? Azure also provides first-class support for Java, Node.js, Python, Ruby, and PHP and our recently announced Azure cloud security tools will work both on premise and in Amazon’s cloud.”

I know personally I’ve had to step up my game in terms of embracing Linux as an OS – including the LAMP stack (Linux-Apache-MySQL-PhP) – and Java. We can’t afford to be insular and think that MSFT provides the best in class tool for every given scenario. And on the Linux side of things, combining your development with the VS IDE means you’ll have enterprise-level debugging and repeatability that your team has been craving. Wow!

Anyway as I’ve left behind my prejudices and misconceptions about what Linux and OS programming is like, I’ve become a better programmer and a much better architect. I’m glad the old adversarial ways have passed behind; cooperation and integration is the new game and it frees up a lot of limitations that used to hurt our customers. Nicely done guys!

Other Points

  • A great article on 31 points successful people have used to form positive life habits. Three that popped out – get enough sleep (no screen time 1 hr before bed), waking up early and working out is nonnegotiable, and eating the frog (putting the most important things first).
  • A great comic on the long road to confronting climate change, highly recommended.

Pomodoro. Maybe start small?

Saw a great blog post today on boosting productivity using Pomodoro. This works very well for programmers especially where we are faced with multiple tasks coming at us all at once – or a major project where we are having trouble breaking it into little pieces. It integrates very well with Agile and Kaizen-based principles and techniques. And it couldn’t be simpler – all you need is a timer. The idea is, by having 25 minute chunks of focus time on work, followed by a break, you’ll get more done each day – with less churn.

(from Wikipedia)

One of my life goals is to write a novel. Actually getting off my rear end and writing is proving to be something of a challenge. I downloaded Simple Pomodoro today from the Google app store to my phone, and linked it to Google Keep in a few simple steps. Now I am churning through some pages, 25 minutes at a time – my goal is 6 “focus times” in a day. AMAZING progress! Give it a try.

 

DevOpoly!

This is the fourth of a series on DevOps. The first focused on the three ways explored in the Phoenix Project, and I stuck in some thoughts from the Five Dysfunctions of a Team by Lencioni. The second discussed the lessons taught by GM’s failure in adopting Toyota’s Lean processes with their NUMMI plant. The third went through some great lessons I’ve learned from “Visible Ops” by Gene Kim.

“The single largest improvement an IT organization can benefit from is implementing repeatable system builds. This can’t be done without first managing change and having an accurate inventory. When you convert a person-centric and heavily manual process to a quick and repeatable mechanism, the reaction is always positive. Even a partially automated release/build process greatly improves the ability for individuals to be freed from firefighting and focus on their areas of real value. And by making it more efficient to rebuild than repair, you also get much faster systems downtime and significantly reduced downtime.” (Joe Judge, Adero)

 

 

So I am putting together a presentation for PADNUG tomorrow on DevOps. I’ve reworked this presentation like three times, and I’ve never been very happy with it. Let’s just say Steve Jobs would have rolled his eyes at something like this:

Look at that crap above. I mean, there’s information here – but way too MUCH information. There’s no way any audience is going to absorb this. I’ll lose them halfway through the second bullet point.

So, I was struggling with this a few weeks ago, trying to come up with a better idea. And I was watching my kids play Monopoly. And I started to think – since there’s no recipe for DevOps, and you can choose your own course, and some amount of it is up to chance or your individual circumstances – well, isn’t that a game? (And isn’t that a more fun way of learning than using an endless stream of bullet points?)

So, DevOpoly was born!

Let’s take a look at this in blocks shall we?

  • MTTR – Mean Time to Repair. This indicates how robust you are, how quickly you can respond and react to an issue.
  • Stakeholder Signoff – this is after you inventory your applications – instituting any change management policy and change window will require the business to provide signoff.
  • Inventory Apps – listing applications, servers, systems and services in tiers. This is a prereq for getting your problem children identified and frozen, see below.
  • CAB Weekly Meetings – I used to think these were a complete and total waste of time. In fact several books I have claim that they don’t measurably reduce defects and slow down development – bureaucracy at its worst. But, Gene Kim swears by it – and he thinks it’s a base level requirement for change management culture.

  • Versioned Patches – Putting any software patches into source control
  • Security Auditing – having controls that are visible, verifiable, regularly reported
  • Configuration Management – Infrastructure as Code, a key part of implementing repeatable system builds, using software like Puppet, Chef, Octopus etc.
  • Golden Build – The end goal and the building block of a release library, a set of ‘golden builds’ that are verifiable and QA’d. The length of time that these builds stay stable is another metric helpful in determining reliability of your apps.

  • Feed to Trouble Ticket – Creating a system where any changes – authorized or unauthorized – show up in trouble ticket for first responders to access. % Success rate in first response in diagnosis is a key metric for DevOps.
  • Dashboarding – creating visibility around these metrics (see stage 3 of the Phoenix Project post) is the only way you’ll know if you’re making progress – and securing management support.
  • Form RM Team – This is part of the process in moving more staff away from firefighting and early in the release process. Mature, capable orgs have more personnel assigned to protect quality early on versus catching defects late.

 

  • MTBF – Mean Time Between Failures. As configuration management knocks out snowflake servers and fragile artifacts are frozen, this number should go up.
  • Automated Release – creating a release management pipeline of dev bits from DEV-QA-STG-PROD, with as much automated signoff as possible using automated tests, is a great step forward.
  • Gated Builds – See above, but having functional/integration testing and unit tests run on checkin is key to prevent failures.
  • Continuous Integration – bound up with testing and the RM cycle – having any dev changes get checked in and validated and merged safely with other development changes. (And, remember, CI means the barest amount of release branching possible. It’s a tough balance.)

  • Eliminate Access – Actually I don’t know many devs (besides the true cowboys) that really WANT access to production. But, removing access to all but change managers is a key step. And when you’re done with that…
  • Electrify the Fence – Have change policy known and discipline the (inevitable) slow learners. Not fire them. Maybe have a few “disappear” in suspicious accidents, to warn the others!
  • Monitor Changes – Use some software (like Tripwire maybe?) to monitor any and all changes to the servers.
  • Server to Admin Ratio – Typically this is a 15:1 ratio – but for high performing orgs with an excellent level of change management, 100:1 or greater is the norm.

  • Document Policy – Writing out the change management policy is a key to electrifying the fence and preventing the org from slipping back into bad habits.
  • Rebuild Not Repair – With a great release library of golden builds and a minimal amount of unique configs and templates, infrastructure is commonly rebuilt – not patched and limping along.

  • Find Fragile Artifacts – Once you’ve done your systems inventory, you can document the systems that have the lowest uptime, the highest impact to the business when its down, and the most expensive infrastructure.
  • Enforce Change Window – Set a change window for each set of your applications, and freeze any and all changes outside of that window. It must be documented and stakeholders must provide signoff.
  • Soft Freeze Fragile Systems – These fragile artifacts have to be frozen, one by one, until the environments can be safely replicated and maintained. This soft freeze can’t last long until the systems are part of configuration management/IAC.

  • Accountability – #1 of the two failure points in any change. True commitment and accountability from each person involved.
  • Firefighting Tax – Less than 5% of time spent in firefighting is a great metric to aim for. Most organizations are at about 40%.
  • Management Buy-In – DevOps can be started as a grassroots effort, but for it to be successful- it must have solid buy-in from the top. Past a pilot effort, you must secure management approval by publicizing your dashboards and key metrics.

Anyway, this was fun. I have some cards on the way for both the Gene Kim Chest – yes, not Jez Humble, but I’m thinking about it – and Chance. Lots of chance in the whole DevOps world.

(I tried this back in August with Life but it never worked by the way.)

 

 

“All Happy Families Are Alike” – Visible Ops by Gene Kim review

This is the third of a series of three posts I’ve done on DevOps recently. The first focused on the three ways explored in the Phoenix Project, and I stuck in some thoughts from the Five Dysfunctions of a Team by Lencioni. The second discussed the lessons taught by GM’s failure in adopting Toyota’s Lean processes with their NUMMI plant. This one will go through some great lessons I’ve learned from a terrific – and very short and readable – little book entitled “Visible Ops” by Gene Kim. Please, order this book (just $17 on Amazon!) and give it some thought.

“The single largest improvement an IT organization can benefit from is implementing repeatable system builds. This can’t be done without first managing change and having an accurate inventory. When you convert a person-centric and heavily manual process to a quick and repeatable mechanism, the reaction is always positive. Even a partially automated release/build process greatly improves the ability for individuals to be freed from firefighting and focus on their areas of real value. And by making it more efficient to rebuild than repair, you also get much faster systems downtime and significantly reduced downtime.” (Joe Judge, Adero)

I was always struck by the phrase from Tolstoy – “All happy families are alike, every unhappy family is unhappy in its own way.” Turns out that’s true of DevOps as well. Successful companies, it turns out, have some very common threads in terms of IT:

  • High service levels and availability
    • Mean Time To Repair (MTTR)
    • Mean Time Between Failures (MTBF)
  • High throughput of effective change
    • Change success rate >99% (for example, amazon with 1500+ changes a week)
  • Tight collaboration between dev, Ops/IT, QA team, and security auditors
    • Controls are visible, verifiable, regularly reported
  • Low amt of unplanned work
    • <5% of time spent firefighting – typical is 40%
  • Systems highly automated and hands-free
    • Server to System Admins ratio 100:1 or greater (typical 15:1)

 

So what are the common factors with the happy families” that have these highly efficient, repeatable RM culture?

  • A change management culture
    • Management by fact versus belief
    • All changes go through a formal change management process
      • “The only acceptable number of unvetted change is zero.”
      • “Change management is important to us, because we are always one change away from being a low performer.”
      • “Perceptions of nimbleness and speed are a delusion if you are tied down in firefighting.”
      • “The biggest failure in any process engineering effort is accountability and true management commitment to the process.”
  • No voodoo – causality over gut feel
    • Trouble ticket systems – inside each ticket are all scheduled changes and all detected changes with the system.
      • This leads to 90% first fix rate and 80% success rate in initial diagnosis
  • Human Factors Come First in Continual Improvement
    • Strong desire to find production variance early
    • Controls to find variance, preventative and detective.

Every unhappy family though is unhappy in their own way. You’ll hear sayings like the following in these “DevOps won’t work for us, we’re unique and special” type organizations:

  • “80% of our outages are due to changes – and 80% of the time we take in implementing a repair is trying to find that change” – Gartner
  • Data and continual improvement takes a back seat to intuition, gut feel, highly skilled IT Ops staff
  • SLA not met
  • “Most of our work is caused by self-inflicted problems and uncontrolled changes. Each sprint I start with a blank slate, and each sprint ends with 50% of my development firepower getting sucked away into firefighting.”
  • Infrastructure is repaired not rebuilt- “priceless works of art”
  • System failures happening at worst possible time, IT’s rep is damaged
  • Changes have a long fuse
  • One change can undo a series of change(s)

So how does an unhappy family move towards becoming more functional? Gene Kim has broken it down into four logical steps.

  • Phase 1 – Stabilize the Patient
    • Freeze changes outside maintenance window
    • First responders have all change related data at hand
  • Phase 2 – Find the Problem Child
    • Inventory your systems and identify systems with low change success, high repair time, high downtime business impact
  • Phase 3 – Grow your Repeatable Build Library
  • Phase 4 – Enable continuous Improvement

In a little more detail:

  • Phase 1 – Stabilize The Patient
    • Beginning of step for Goal is to allow highest possible change throughput with least amount of bureaucracy possible. No rubber stamping, change request tracking system feeds info to first responders, ensure solid backup plan.
    • Inventory applications and identify stakeholders and systems
    • Document new change management policy and change window with stakeholders
    • Institute weekly change management meetings
    • Eliminate access to all but authorized change managers
    • Electrify the fence with instrumentation, monitoring
      • you’ll be shocked at what you find!
      • this prevents org from falling back into bad old habits, like a rock climber with a ratchet and rope
    • Failure Points
      • We won’t be able to get anything done!
      • The business pays us to make changes. Not to sit in boring CM meetings.
      • We trust our own people – they’re professionals and don’t need micromanaging.
      • We already tried that – it didn’t work
      • We believe there are no unauthorized changes.
  • Phase 2 – Find The Problem Children
    • Analyze assets, find fragile artifacts (use list from Phase 1)
    • Must be fast. Can’t freeze changes forever.
    • Soft freeze, where truly urgent changes during this period go through CAB.
    • Failure Points
      • Pockets of knowledge and proficiency
      • Servers are snowflakes – irreplaceable artifacts of mission critical infrastructure
  • Phase 3 – Grow Your Repeatable Build Library
    • Create a RM team. (Shifts team to pre-prod activities)
    • Take fragile artifacts in priority – create golden builds stored in software library
    • Separation of roles – devs have no access to production
    • Amount of unplanned changes (and related work) further drops
    • # of unique configurations in deployment drops, increasing server/admin ratio
    • Mitigated the “patch and pray” dilemma, updates integrated into the RM process for patches to be tested and safely rolled out
  • Phase 4 – Enable Continuous Improvement
    • This has to do with gathering metrics and measuring improvement along three lines – release, controls, and resolution.

  • Release – how efficiently and effectively can we generate and provision infrastructure?
    • Time to provision known good builds
    • Number of turns to a known good build
    • Shelf life of a build
    • % of systems that match known good builds
    • % of builds with security signoff
    • # of fast-tracked builds
    • Ratio of Release Engineers to System Admins
  • Controls – how effectively do we make good change decisions that keep infrastructure available, predictable and secure?
    • # of changes authorized per week
    • # of actual changes made per week
    • # of unauthorized changes
    • Change success rate
    • Number of unauthorized changes
    • Changes submitted vs changes reviewed
    • Change success rate
    • Number of service-affecting outages
    • Number of emergency changes or “special” changes
    • Change management overhead (measure bureaucracy, lower is better!)
  • Resolution – when things go wrong, how effectively do we diagnose and resolve issue?
    • MTTR – Mean Time To Repair
    • MTBF – Mean Time Between Failure