Uncategorized

DevOps Stories – Interview with Nigel Kersten of Puppet

Nigel came to Puppet from Google HQ in Mountain View, where he was responsible for the design and implementation of one of the largest Puppet deployments in the world. At Puppet, Nigel was responsible for the development of the initial versions of Puppet Enterprise and has since served in a variety of roles, including head of product, CTO, and CIO. He’s currently the VP of Ecosystem Engineering at Puppet. He has been deeply involved in Puppet’s DevOps initiatives, and regularly speaks around the world about the adoption of DevOps in the enterprise and IT organizational transformation.

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in mid 2019 and available now for pre-order!

The Deep End of the Pool

I grew up in Australia; I was lucky enough to be one of those kids that got a computer. It turns out that people would pay me to do stuff with them! So I ended up doing just that – and found myself at a local college, managing large fleets of Macs and handling a lot of multimedia and audio needs there. Very early in my career, I found hundreds of people – students and staff – very dependent on me to be The Man, to fix their problems. And I loved being the hero – there’s such a dopamine hit, a real rush! The late nights, the miracle saves – I couldn’t get enough.

Then the strangest thing happened – I started realizing there was more to life than work. I started getting very serious about music, to the point where I was performing. And I was trying a startup with a friend on the side. So, for a year or two, work became – for the first time – just work. Suddenly I didn’t want to spend my life on call, 24 hours a day – I had better things to do! I started killing off all my manual work around infrastructure and operations, replacing it with automation and scripts.

That led me to Google, where I worked for about five years. I thought I was a scripting and infrastructure ninja – but I got torn to shreds by the Site Reliability Engineers there. It was a powerful learning experience for me – I grew in ways I couldn’t have anywhere else. For starters, it was the deep end of the pool. We had a team of four managing 80,000 machines. And these weren’t servers in a webfarm – these were roaming laptops, suddenly appearing on strange networks, getting infected with malware, suffering from unreliable network connections. So we had to automate – we had no choice about it. As an Ops person, this was a huge leap forward for me – it forced me to sink or swim, really learn under fire.

Then I left for Puppet – I think I was employee #13 there – now we’re at almost 500 and growing. I’m the Chief Technical Strategist, but that’s still very much a working title – I run engineering and product teams, and handle a lot of our community evangelism and architectural vision. Really though it all comes down to trying to set our customers up for success.

Impoverished Communication

I don’t think our biggest challenge is ever technical – it’s much more fundamental than that, and it comes down to communication. There’s often a real disconnect between what executives think is true – what they are presenting at conferences and in papers – and what is actually happening on the ground. There’s a very famous paper from the Harvard Business Review back in the 70’s that said that communication is like water. Communication downwards is rarely a problem, and it works much better than most managers realize. However, open and honest communication up the chain is hard, like trying to pump water up a hill. It gets filtered or spun, as people report upwards what their manager wants to believe or what will reflect well on them – and next thing you know you have an upper management layer that thinks they are well informed but really is in an echo chamber. Just for example, take the Challenger shuttle disaster – technical data that clearly showed problems ahead of the explosion were filtered out, glossed over, made more optimistic for senior management consumption.

We see some enterprises out there struggling and it becomes this very negative mindset – “oh, the enterprise is slow, they make bad decisions, they’re not cutting edge.” And of course that’s just not true, in most cases. These are usually good people, very smart people, stuck in processes or environments where it’s difficult to do things the right way. Just for example, I was talking recently to some very bright engineers trying to implement change management, but they were completely stuck. This is a company that is about 100,000 people – for every action, they had to go outside their department to get work done. So piecemeal work was killing them – death by a thousand cuts.

Where To Start

In most larger enterprises aiming for complete automation, end to end, is somewhat of a pipe dream – just because these companies have so many groups and siloes and dependencies. But that’s not saying that DevOps is impossible, even in shared services type orgs. This isn’t nuclear science, it’s like learning to play the piano. It doesn’t require brilliance, it’s not art – it’s just hard work. It just takes discipline and practice, daily practice.

I have the strong impression that many companies out there SAY they are doing DevOps, whatever that means – but really it hasn’t even gotten off the ground. They’re still on Square 1, analyzing and trying to come up with the right recipe or roadmap that will fit every single use case they might encounter, past present and future. So what’s the best way forward if you’re stuck in that position?

Well, first off, how much control do you have over your infrastructure? Do you have the ability to provision your VM’s, self-service? If so you’ve got some more cards to play with. Assuming you do – you start with version control. Just pick one – ideally a system you already have. Even if it’s something ancient like Subversion – if that’s what you have, use it as your one single source of truth. Don’t try to migrate to latest and greatest hipster VC system. You just need to be able to programmatically create and revert commits. Put all your shell scripts in there and start managing your infrastructure from there, as code.

Now you’ve got your artifacts in version control and you’re using it as a single repository, right? Great – then talk to the people running deployments on your team. What’s the most painful thing about releases? Make a list of these items, and pick one and try to automate it. And always prioritize building blocks that can be consumed elsewhere. For example, don’t attempt to start by picking a snowflake production webserver and trying to automate EVERYTHING about it – you’ll just end up with a monolith of infrastructure code you can’t reuse elsewhere, your quality needle won’t budge. No, instead you’d want to take something simple and in common and create a building block out of it.

For example, time synchronization – it’s shocking, once you talk to Operations people, how something so simple and obvious as a timestamp difference between servers can cause major issues – forcing a rollback due to cascading issues or a troubleshooting crunch because the clocks on two servers drifted out of synch and it broke your database replication. That’s literally fixed in Linux by installing a single package and config. But think about the reward you’ll get in terms of quality and stability with this very unglamorous but fundamental little shift.

Take that list and work on what’s causing pain for your on-call people, what’s causing your deployments to break. The more you can automate this, the better. And make it as self-service as possible – instead of having the devs fire off an email to you, where you create a ticket, then provision test environments – all those manual chokepoints – wouldn’t it be better to have the devs have the ability to call an API or click on a website button and get a test environment spun up automatically that’s set up just like production? That’s a force multiplier in terms of improving your quality right at the get-go.

Now you’ve got version control, you can provision from code, you can roll out changes and roll them back. Maybe you add in inventory and discoverability of what’s actually running in your infrastructure. It’s amazing how few organizations really have a handle on what’s actually running, holistically. But as you go, you identify some goals and work out the practices you want to implement – then choose the software tool that seems the best fit.

Continuous Delivery Is The Finish Line

The end goal though is always the same. Your target, your goal is to get as close as you can to Continuous Integration / Continuous delivery. Aiming for continuous delivery is the most productive single thing an enterprise can do, pure and simple. There’s tools around this – obviously working for Puppet I have my personal bias as to what’s best. But pick one, after some thought – and play with it. Start growing out your testing skills, so you can trust your release gates.

With COTS products you can’t always adopt all of these practices – but you can get pretty close, even with big-splash, multi-GB releases. For example, you can use deployment slots and script as much as you can. Yes, there’s going to be some manual steps – but the more you can automate even this, the happier you’ll be.

Over time, kind of naturally, you’ll see a set of teams appear that are using CI/CD, and automation, and the company can point to these as success stories. That’s when an executive sponsor can step in and set this as a mandate, top down. But just about every DevOps success story we’ve seen goes through this pioneering phase where they’re trying things out squad by squad and experimenting – that’s a good thing. You can’t skip this, no more than a caterpillar can go right to being a butterfly.

DevOps Teams

At first I really hated the whole DevOps Team concept – and in the long term, it doesn’t make sense. It’s actually a common failure point – a senior manager starts holding this “A” team up as an example. This creates a whole legion of haters and enemies, people working with traditional systems who haven’t been given the opportunity to change like the cool kids – the guys always off at conferences, running stuff in the cloud, blah blah. But in the short term it totally has its place. You need to attach yourself to symbols that makes it clear you’re trying to change. If you try to boil the ocean or spin it out with dozens of teams, it gets diluted and your risk rises, it could lose credibility. Word of mouth needs to be in your favor, kind of like band t-shirts for teenagers. So you can start with a small group initially for your experiments – just don’t let it stay that way too long.

But what if you DON’T have that self-provisioning authority? Well there’s ways around that as well. You see departments doing things like doing capacity planning and reserving large pools of machines ahead of time. That’s obviously suboptimal and it’s disappearing now that more people are seeing what a powerful game-changer the cloud and self-provisioned environments are. The point is – very rarely are we completely shackled and constrained when it comes to infrastructure.

Automation and Paying Off Technical Debt

It’s all too easy to get bogged down in minutiae when it comes to automation. I said earlier that DevOps isn’t art, it’s just hard work – and that’s true. But focus that hard work on the things that really matter. Your responsibility is to make sure you guard your time and that of the people around you. If you’re not careful, you’ll end up replacing this infinite backlog of manual work you have to do with an infinite amount of tasks you need to automate. That’s really demoralizing, and it really hasn’t made your life that much better!

Let’s take the example of a classic three-tier web app you have onprem. And you’ve sunk a lot of time into it so that now it fails every week versus every 6 months – terrific! But for that next step – instead of trying to automate it completely end to end, which you could do – how could you change it so that its more service oriented, more loosely coupled, so your maintenance drops even more and changes are less risky? Maybe building part of it as a microservice, or putting up that classic Martin Fowler strangler fig, will give you this dramatic payoff you would never get with grinding out automation for the sake of automation and never asking if there’s a better way.

Paying off technical debt is a grind, just like paying off your credit card and paying off the mortgage. Of course you need to do that – but it shouldn’t be all you do! Maybe you’ll take some money and sink it into an investment somewhere, and get that big boost to your bottom line. So instead of mindlessly just paying off your technical debt, realize you have options – some great investment areas open to you – that you can invest part of your effort in.

Optimism Bias and Culture

This brings us right back to where we started, communication. There is a fundamental blind spot in a lot of books and presentations I see on DevOps, and it has to do with our optimism bias. DevOps started out as a grassroots, community driven movement – led and championed by passionate people that really care about what they’re doing, why they’re doing it. Pioneers like this are a small subset of the community though – but too often we assume ‘everyone is just like us’! What about the category a lot of people fall in – the ones who just want to show up, do their job, and then go home? If we come to them with this crusade for efficiency and productivity, it just won’t resonate with the 9 to 5 crowd. They like the job they have – they do a lot of manual changes, true, but they know how to do it, it guarantees a steady flow of work and therefore income, and any kind of change will not be viewed as an improvement – no matter how you try to sell it. You could call this “bad”, or just realize that not everyone is motivated by the same things or thinks the same way. In your approach, you may have to mix a little bit of pragmatism in with that DevOpsy-starry eyed idealism – think of different ways to reach them, work around them, or wait for a strong management drive to collapse this kind of resistance.

DevOps Stories – Interview with John Weers of Micron

John Weers is Senior Manager of DevOps and Software Quality at Micron. He works to build highly capable teams that trust each other, build high quality software, deliver value with each sprint and realize there’s more to life than work.

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in mid 2019 and available now for pre-order!

Kickstarting a DevOps Culture

Some initial background – I lead on a team of passionate DevOps engineers/managers who are tasked with making our DevOps transformation work. While our group is only officially about 5 months old, we’ve all been working this separately for quite a while.

About every two weeks we have a group of about 15 DevOps experts that get together and talk – we call them the “design team”. That’s a critical touch point for us – we identify some problems in the organization, talk about what might be the best practice for them, and then use that as a base in making recommendations. So that’s how we set up a common direction and coordinate; but we each speak for and report to a different piece of the org. That’s a very good thing – I’d be worried if we were a separate group of architects, because then we’d get tuned out as “those DevOps guys”. It’s a different thing altogether if a recommendation is coming from someone working for the same person you do!

We’ve made huge strides when it comes to being more of a learning-type organization – which means, are we risk-friendly, do we favor experimentation? When there’s a problem, we’re starting to focus less on root cause and ‘how do we prevent this disaster from happening again’ – and more on, what did we learn from this? I see teams out there trying new things, experimenting with a new tool for automation – and senior management has responded favorably.

Our movement didn’t kick off with a bang. About 5 years ago, we came to the realization that our quality in my area of IT was poor. We knew quality was important, but didn’t understand how to improve it. Some of the software we were deploying was overly complex and buggy. In another area, the issue wasn’t quality but time – the manual test cycle was too long, we’re talking weeks for any release.

You can tell we’re making progress by listening to people’s conversations – it’s no longer about testing dates or coverage percentages or how many bugs we found this month, but “how soon can we get this into production?” – most of the fear is gone of a buggy release as we’ve moved up that quality curve. But it has been a gradual thing. I talked to everyone I could think of at conferences, about their experiences with DevOps. It took a lot of trial and error to find out what works with our organization. No one that I know of has hit on the magical formula right off the bat; it takes patience and a lot of experimentation.

Start With Testing

Our first effort was to target testing – automated testing, in our case using HP’s UFT and Quality Center platform. But there never was an all-hands-on-deck, call to “Do DevOps!” – that did happen, but it came two years later. We had to lay down the groundwork by focusing first on quality, specifically testing.

We’re five years along now and we are making progress, but don’t kid yourself that growth or a change in mindset happens overnight. Just the phrase “Shift Left” for example – we did shift our quality work earlier in the development process by moving to unit testing and away from UI/Regression testing. We found that it decreased our bugs in production by a very significant amount.

We went through a few phases – one where we had a small army of contractors doing test automation and regression testing against the UI layer. Quality didn’t improve, because of the he-said/she-said type interactions between the developers and QA teams in their different siloes. We tried to address interactions between different applications and systems with integration testing, and again found little value. The software was just too complex. Then we reached a point where we realized the whole dynamic needed to be rethought.

So, we broke up the QA org in its entirety, and assigned QA testers on each of our agile teams and said – you guys will sink or swim as a team. Our success with regression testing went up dramatically, once we could write tests along with the software as it was being developed. Once a team is accountable for their quality, they find a way of making it happen.

We got resistance and kickback from the developers, which was a little surprising. There was a lot of complaint when we first started requiring developers to write unit tests along with their code of it not being “value added” type activity. But we knew this was something that was necessary – without unit tests, by the time we knew there was a problem in integration or functional testing, it would often be too late to fix it in time before it went out the door.

So, we held the line and now those teams that have a comprehensive unit testing suite are seeing very few errors being released to production. At this point, those teams won’t give up unit testing because it’s so valuable to them.

“Shift Left” doesn’t mean throwing out all your integration and regression testing. You still need to do a little testing to make sure the user experience isn’t broken. “Shift Left” means test earlier in the process, but in my mind it also means that “our team” owns our quality.

Culture and Energy are the Limiting Points

If you want to “Do DevOps” as a solo individual, you’ll fail. You need other experts around you to share the load and provide ideas and help. A group is stronger than any individual.

Can I say – the tool is not the problem, ever? It’s always culture and energy. What I seem to find is, we can make progress in any area that I or another DevOps expert can personally inject some energy into. If I’m visible, if I talk to people, if I can build a compelling storyline – we make rapid progress. Without it, we don’t. It’s almost like starting a fire – you can’t just crumple up some newspaper, dump some kindling on it, light a match and walk away. You’ve got to tend it, constantly add material or blow on it to get something going.

We’re spread very thin; energy and time are limited, and without injecting energy things just don’t happen. That’s a very common story – it’s not that we’re lazy, or bad, or stupid – we work very hard, but there’s so much work to be done we can’t spare the cycles to look at how we’re going about things. Sometimes, you need an outside perspective to provide that new idea, or show a different way.

Lead By Listening

One of the base principles of DevOps is to find your area of pain and devote cycles into automating it. That removes a lot of waste, human defects, errors when you’re running a deployment. But that doesn’t resonate when I work with a team that’s new to DevOps. I don’t walk in there with a stone tablet of commandments, “here’s what you should do to do DevOps”. That’s a huge turn-off.

Instead, I start by listening. I talk to each team ask them how they go about their work, what they do, how they do it. Once we find out how things are working, we can also identify some problems – then we can come in and we can talk about how automation can address that problem in a way that’s specific to that team, how DevOps can make their world better. They see a better future and they can go after it.

Tools as an Incentive

I just said the tool isn’t the problem, but that doesn’t mean it’s not a critical part of the solution. I’m a techie at heart and I like a shiny new tool just as much as the next person. You can use tools as incentives to get new changes rolling. It’s a tough sell to walk into a meeting and pitch unit testing as a cure to quality issues if they take a long time to write. But if we talk about using Visual Studio Enterprise and how it makes unit tests simple and it’s able to run them real time, now it becomes easier to do unit testing than to test the old way. If we can show how these tools can shrink testing to be an afterthought instead of a week, now we have your attention!

About a year ago, our CIO set a mandate for the entire organization to excel at both DevOps and Agile. But the architecture wasn’t defined, no tools were specified. Which is terrific – DevOps and Agile is just a way of improving what we can do for the business. We now see different teams having different tech stacks and some variation in the tools based on what their pain point is and what their customers are needing. As a rule, we encourage alignment where it makes sense around either a technology stack or with a common leader. That provides enough alignment that teams can learn from each other and yet look for better ways of solving their issues.

The rule is that each main group in IT should favor a toolchain, but should choose software architecture that fits their business needs. In one area, for example, the focus is on getting changes into production as fast as possible. This is the cutting edge of the blade, so automation and fast turnaround cycles are everything. For them, microservices are a terrific option and the way that their development happens – it fits the business outcomes they want.

Do You Need the Cloud?

They’ll tell you that DevOps means the cloud; you can’t do it without rapid provisioning which means scalable architecture and massive cloud-based datacenters. But we’re almost 100% on-prem. For us, we need to keep our software, especially R&D, privately hosted. That hasn’t slowed us down much. It would certainly be more convenient to have cloud-based data centers and rapid provisioning, but it’s not required by any means.

Metrics We Care About

We focus on two things – lead time (or cycle time in the industry) and production impact. We want to know the impact in terms of lost opportunity – when the fab slows down or stops because of a change or problem. That resonates very well with management, it’s something everyone can understand.

But I tell people to be careful about metrics. It’s easy to fall in love with a metric and push it to the point of absurdity! I’ve don’t this several times. We’ve dabbled in tracking defects, bug counts, code coverage, volume of unit testing, number of regression tests – and all of them have a dark side or poor behavior that is encouraged. Just for example, let’s say we are tracking and displaying volume of regression tests. Suddenly, rather than creating a single test that makes sense, you start to see tests getting chopped up into dozens of tests with one step in them so the team can hit a volume metric. With bug counts – developers can classify them as misunderstood requirement rather than admitting something was an actual bug. When we went after code coverage, one developer wrote a unit test that would bring the entire module of code under test and ran that as one gigantic block to hit their numbers.

We’ve decided to keep it simple – we’re only going to track these 2 things – cycle time and production impact – and the teams can talk individually in their retrospectives about how good or bad their quality really is. The team level is also where we can make the most impact on quality.

I’ve learned a lot about metrics over the years from Bob Lewis’ IS Survivor columns. Chief among those lessons is to be very, very careful about the conversation you have with every metric. You should determine what success looks like, and then generate a metric that gives you a view of how your team is working. All subsequent conversations should be around “if we’re being successful” and not “are we achieving the metric.” The worst thing that can happen is that I got what I measured.

PMO Resistance

Sometimes we see some resistance from the BSA/PM layer. That’s usually because we’re leading with our left foot – the right way is to talk about outcomes. What if we could get code out the door faster, with a happier team, with less time testing, with less bugs? When we lead with the desired outcome, that middle layer doesn’t resist, because we’re proposing changes that will make their lives easier.

I can’t stress this enough – focus on the business outcomes you’re looking for and eliminate everything else. Only pursue a change if the outcome fits one of those business needs.

When we started this quality initiative, initially our release cycle averaged – I wish I was exaggerating – about 300 days. We would invest a huge amount of testing at every site before we would deploy. Today, we have teams with cycle times under 10 days. But that speed couldn’t happen unless our quality had gone up. We had to beef up our communication loop with the fab so if there was a problem we can stop it before it gets replicated.

The Role of Communication

You can’t overstate credibility. As we create less and less impact with changes we deploy, our relationship with our customers in the business gets better and better. Just for example, three years ago we had just gone through a disastrous communication tool patch that had grounded an entire site for hours. We worked through the problems internally and then I came to a plant IT director a year later and told them that we thought the quality issues were taken care of and enlisted their help.

Our next deployment required 5 minutes of downtime and had limited sporadic impact. And that’s been the last real impact we’ve had during software deployment for this tool in almost 3 years – now our deployments are automated and invisible to our users. Slowly building up that credibility and a good reputation for caring about the people you’re impacting downstream has been a big part of our effort.

Cross-Functional Teams

It’s commonly accepted that for DevOps to work you must be cross-functional. We are like many other companies in that we use a Shared Services model – we have several agile teams that include development, QA roles, an infrastructure team, and Operations which handles trouble tickets from the sites – each with their own leader. This might be a pain point in many companies, but for us it’s just how we work. We’ve learned to collaborate and share the pain so that we’re not throwing work over the fence. It’s not always perfect, but it’s very workable.

For example, in my area every week we have a recap meeting which Ops leads, where they talk about what’s been happening in production and work out solutions with the dev managers in the room. In this way the teams work together and feel each other’s pain. We’re being successful and we haven’t had to break up the company into fully cross-functional groups.

Purists might object to this – we haven’t combined Development and Operations, so can we really say that we are “doing DevOps”? If it would help us drive better business outcomes, that org reshuffling would have happened. But for us, since the focus is on business outcomes, not on who we report to, our collaboration cross team is good and getting better every day. We’re all talking the same language, and we didn’t have to reshuffle. We’re all one team. The point is to focus on the business outcomes and if you need to reorg, it will be apparent when teams talk about their pain points.

If It Comes Easy, It Doesn’t Stick

Circling back to energy – sometimes I sit in my office and wish that culture was easier to change. It’d be so great if there was a single metric we could align on, or a magical technique where I could flip a switch and everyone would get it and catch fire with enthusiasm. Unfortunately, that silver bullet doesn’t exist.

Sometimes I listen to Dave Ramsey on my way in to work – he talks about changing the family tree and getting out of debt. Something he said though resonated with me – “If it comes easy, it doesn’t stick.” If DevOps came easy for us, it wouldn’t really have the impact on our organization that we need. There’s a lot of effort, thought, suffering – pain, really – to get any kind of outcome that’s worth having.

As long as you focus on the outcome, I believe DevOps is a fantastic thing for just about any organization. But, if you view it as a recipe that you need to follow, or a checklist – you’re on the wrong track already, because you’re not thinking about outcomes. If you build from an outcome that will help your business and think backwards to the best way of reaching that outcome – then DevOps is almost guaranteed to work.

Achieving DevOps – the back story

In writing the book “Achieving DevOps“, we threw away easily as many words as we ended up keeping. I wish space would have allowed us to talk in more depth about waste, Mission Command, and some other principles that we could only skim over at best.

We talk about this in the book as well – but we’re so much in debt to the bright people out there and the lasting work they’ve done. Not all of these were directly referenced in the book, but all influenced us. We didn’t have room for them in the book, but we figure this might be a nice starting point.

In doing our research – which was something we were only able to pull away from with regret and a few sledgehammer whacks by our publisher – some books stood out as being especially amazing. These, I’ve put below with the book cover as an active hyperlink – you can go right to Amazon and buy it from there. (We don’t get paid in any way for this. It’s just to help give back a little.)

But really, the best books I’ve already talked about in my post on “Where To Start?”

OK, on to the hotlinks:

Chapter 2 – Ratcheting Change

[robha] – “A Counterintuitive Strategy for Building a Daily Exercise Habit”, Rob Hardy. Medium.com, 7/21/2017. https://betterhumans.coach.me/a-counterintuitive-strategy-for-building-a-lifelong-exercise-habit-13471da4e49d. A great article that first got us thinking about bright lines and activation energy.
[bjfth] – “Tiny Habits”, BJ Fogg, Stanford University, 1/1/2018. https://www.tinyhabits.com/
[bjthgs] – “Find a good spot in your life”, BJ Fogg. Stanford University, 1/1/2018. https://www.tinyhabits.com/good-spot
[jclat] – “Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones”, James Clear. Avery, 10/16/2018. ISBN-10: 0735211299, ISBN-13: 978-0735211292
[duhigg] – “The Power of Habit: Why We Do What We Do in Life and Business”, Charles Duhigg. Random House, 1/1/2014. ISBN-10: 081298160X, ISBN-13: 978-0812981605
[baume] – “Willpower: Rediscovering the Greatest Human Strength”, Roy Baumeister and John Tierney. Penguin Books, 8/28/2012. ISBN-10: 0143122231, ISBN-13: 978-0143122234
[jclub] – “Do Things You Can Sustain”, James Clear. https://jamesclear.com/upper-bound

Chapter 2 – Kanban

[hanselman] – “Maslow’s Hierarchy of Needs of Software Development”, Scott Hanselman. Hanselman.com, 1/8/2012. https://www.hanselman.com/blog/MaslowsHierarchyOfNeedsOfSoftwareDevelopment.aspx
[ferriss] – “The 4-Hour Workweek: Escape 9-5, Live Anywhere, and Join the New Rich”, Timothy Ferriss, December 2019, ISBN-10: 9780307465351, ISBN-13: 978-0307465351
[drift2] – My original writeup on Timothy Fenriss’ book – https://driftboatdave.com/2014/09/02/being-busy-is-a-form-of-laziness/
[tdoh] – “The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations”, Gene Kim, Patrick Dubois, John Willis, Jez Humble. IT Revolution Press, 10/6/2016, ISBN-10: 1942788002, ISBN-13: 978-1942788003
[forsgren] – “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”, Nicole Forsgren PhD, Jez Humble, Gene Kim. IT Revolution Press, 3/27/2018. ISBN-10: 1942788339, ISBN-13: 978-1942788331

Chapter 2 – Reliability First

[treynor] – “Keys to SRE”, Ben Treynor. SRECon 2014, 5/30/2014. https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre
[srex] – “Resources”, unattributed author(s). Google. https://landing.google.com/sre/resources.html – The Google SRE resource page. Many times you can find some of the O’Reilly SRE books free as a download here.
[lunney] – “Postmortem Action Items: Plan the Work, Work the Plan”, John Lunney, Sue Lueder, Betsy Beyer. ;login, Spring 2017, Vol 42 No 1. https://storage.googleapis.com/pub-tools-public-publication-data/pdf/3eeb4c1d9073ca5910e49f5252cb3cf648487ac2.pdf. This is an outstanding doc for anyone looking to learn from how Google handles postmortems. Note the great checklist on action items post event.
[Hixson] “The Systems Engineering Side of Site Reliability Engineering”, David Hixson, Betsy Beyer – ;login, June 2015, Vol 40, No 3. https://www.usenix.org/system/files/login/articles/login_june_08_hixson.pdf
[toil] – “Invent more, toil less”, Betsy Beyer, Brendan Gleason, Dave O’Connor, Vivek Rau. ;login, Fall 2016, Vol 41, #3. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45765.pdf
[toil3] – “Repairing network hardware at scale with SRE principles”, James O’Keeffe. Google, 8/1/2018. https://cloudplatform.googleblog.com/2018/08/repairing-network-hardware-at-scale-with-sre-principles.html. Another toil reduction case study dealing with repairing network hardware.
[log60] – “Invent More, Toil Less”, Betsy Beyer, Brendan Gleasan, Dave O’Connor, Vivek Rau. Google, 8/1/2016. https://www.usenix.org/system/files/login/articles/login_fall16_08_beyer.pdf. A very good expansion on the Toil sections in the original [sre] book.
[sre] – “Site Reliability Engineering: How Google Runs Production Systems”, Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff, O’Reilly Media; 4/16/2016, ISBN-10: 149192912X, ISBN-13: 978-1491929124
[vargo3] – “SRE vs. DevOps: competing standards or close friends?”, Seth Vargo. Google Cloud Platform Blog, 5/8/2018. https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html
[lfj] – “SLIs, SLOs, SLAs, oh my!”, Liz Fong-Jones, Seth Vargo. YouTube, 3/8/2018. https://youtu.be/tEylFyxbDLE A great explanation of the use of metrics at Google. The entire series is very entertaining and a must-watch for SRE fans.
[okee] – “Repairing network hardware at scale with SRE principles”, James O’Keefe. Google Cloud Platform Blog, 8/1/2018. https://cloudplatform.googleblog.com/2018/08/repairing-network-hardware-at-scale-with-sre-principles.html For those interested in more details on how Google goes about automating its hardware so they are managed as a fleet – the classic “cattle vs pets” – this is one of the best discussions we’ve seen to date.
[srew] – “The Site Reliability Workbook: Practical Ways to Implement SRE”, edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, Stephen Thorne. O’Reilly Media, 8/4/2018. ISBN-10: 1492029505, ISBN-13: 978-1492029502. The earlier SRE book was outstanding; this is better, as it’s much more applicable outside of Google’s specific use case. Loved the contents; I just wish we’d been aware of this resource earlier in our research. The section on toil is particularly good, filled with practical tips for toil reduction based on real case studies.

Chapter 3 – Continuous Integration

[naik] – “Enabling Trunk Based Development with Deployment Pipelines”, Vishal Naik. Thoughtworks, 10/17/2015. https://www.thoughtworks.com/insights/blog/enabling-trunk-based-development-deployment-pipelines
fowlfb] – “FeatureBranch”, Martin Fowler. MartinFowler.com, 9/3/2009. https://martinfowler.com/bliki/FeatureBranch.html
[gitf] – “Understanding the GitHub flow”, unattributed author(s). GitHub, 11/30/2017. https://guides.github.com/introduction/flow/. The excellent GitHub Flow doc itself.
[fowlft] – “FeatureToggle”, Martin Fowler. MartinFowler.com, 10/29/2010. https://martinfowler.com/bliki/FeatureToggle.html
[hodft] – “Feature Toggles (aka Feature Flags)”, Pete Hodgson, MartinFowler.com, 10/9/2017. https://martinfowler.com/articles/feature-toggles.html . A more in depth discussion than [fowlft].
[boodm] – “How Chromium Works”, Aaron Boodman. Medium, 9/22/2015. https://medium.com/@aboodman/in-march-2011-i-drafted-an-article-explaining-how-the-team-responsible-for-google-chrome-ships-c479ba623a1b . Google Chrome was built using frequent checkins to mainline. – “So, how are the wheels still on the bus? In short: no branches, runtime switches, tons of automated testing, relentless refactoring, and staying very close to HEAD of our dependencies.”
[chen1] – “Stop cherry-picking, start merging, Part 1: The merge conflict”, Raymond Chen. Microsoft Developer blog, 3/12/2018. https://blogs.msdn.microsoft.com/oldnewthing/20180312-00/?p=98215
[chen2] – “Stop cherry-picking, start merging, Part 2: The merge conflict that never happened (but should have)”, Raymond Chen. Microsoft Developer blog, 3/13/2018. https://blogs.msdn.microsoft.com/oldnewthing/20180313-00/?p=98225. Both articles are good references on the potential downsides of cherry-picking – so common in Git. As he points out, it could blow up, or worse it could not blow up, leading to issues silently building up and propagating under the surface. This is good to keep in mind but hardly a universal law – the Azure DevOps team uses cherry-picking heavily in rolling out urgent bugfixes.
[ringms] – “Explore how to progressively expose your Azure DevOps extension releases in production to validate, before impacting all users”, Willy-Peter Schaub and others. Microsoft Docs, 4/25/2018. https://docs.microsoft.com/en-us/azure/devops/articles/phase-rollout-with-rings?view=azure-devops. A good overview on ring deployments at Microsoft and limiting the “blast radius”.
[dora2015] – “Annual State of DevOps Report”, unattributed author(s). Puppet Labs, 2015. https://puppetlabs.com/2015-devops-report
[forsgren] – “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”, Nicole Forsgren PhD, Jez Humble, Gene Kim. IT Revolution Press, 3/27/2018. ISBN-10: 1942788339, ISBN-13: 978-1942788331
[dora2017] – “Annual State of DevOps Report”, unattributed author(s). Puppet Labs, 2017. https://puppetlabs.com/2017-devops-report
[kief] – “Infrastructure as Code: Managing Servers in the Cloud”, Kief Morris. O’Reilly Media, 6/27/2016. ISBN-10: 1491924357, ISBN-13: 978-1491924358
[thoms] – “Release Flow: How We Do Branching on the VSTS Team”, Edward Thomson. MSDN Blogs, 4/19/2018. https://blogs.msdn.microsoft.com/devops/2018/04/19/release-flow-how-we-do-branching-on-the-vsts-team/
[buchw] – “A Git Workflow for Continuous Delivery”, William Buchwalter. Microsoft TechNet, 6/26/2016. https://blogs.technet.microsoft.com/devops/2016/06/21/a-git-workflow-for-continuous-delivery/
[fowlbr] – “BranchByAbstraction”, Martin Fowler. MartinFowler.com, 1/7/2014. https://martinfowler.com/bliki/BranchByAbstraction.html
[wpbstf] – “Explore how to manage branching strategies with a DevOps mindset in Team Foundation Version Control (TFVC)”, Willy-Peter Schaub and others. Microsoft Docs, 4/24/2018. https://docs.microsoft.com/en-us/azure/devops/articles/effective-tfvc-branching-strategies-for-devops?view=vsts Some very solid recommendations here: start with a simple strategy, use a consistent naming convention, and two by-now familiar mantras: encourage consistent peer reviews and gated checkins with automated testing.
[newm] “Building Microservices: Designing Fine-Grained Systems”, Sam Newman. O’Reilly Media; 2/20/2015. ISBN-10: 1491950358, ISBN-13: 978-1491950357
[daws1] – “7 Signs You’re Mastering Continuous Integration”, Brian Dawson. DevOps.com, 7/18/2018. https://devops.com/7-signs-youre-mastering-continuous-integration/ .

Chapter 3 – Shift Left on Testing

[dora2017] – “Annual State of DevOps Report”, unattributed author(s). Puppet Labs, 2017. https://puppetlabs.com/2017-devops-report
[clean] – “Clean Code: A Handbook of Agile Software Craftsmanship”, Robert C Martin. Prentice Hall, 8/11/2008. ISBN-10: 9780132350884, ISBN-13: 978-0132350884
[feathers] – “Working Effectively with Legacy Code”, Michael Feathers. Prentice Hall, 10/2/2004. ISBN-13: 978-0131177055, ISBN-10: 9780131177055. A true masterpiece. Most of us are not blessed with greenfield type projects; I can’t think of many people that wouldn’t benefit greatly from reading this book and understanding how to better tame that monolith looming in the background.
[refactmf] – “Refactoring: Improving the Design of Existing Code”, Martin Fowler. Addison-Wesley Signature Series, 11/30/2018. ISBN-10: 0134757599, ISBN-13: 978-0134757599
[crisp] – “Agile Testing: A Practical Guide for Testers and Agile Teams”, Lisa Crispin, Janet Gregory. Addison-Wesley Professional, 1/9/2009. ISBN-10: 9780321534460, ISBN-13: 978-0321534460
[crisp2] – “More Agile Testing: Learning Journeys for the Whole Team”, Lisa Crispin, Janet Gregory. Addison-Wesley Professional, 10/16/2014. ISBN-10: 9780321967053, ISBN-13: 978-0321967053
[14pt] – “Dr. Deming’s 14 Points for Management”, unattributed author(s). ASQ.org, https://deming.org/explore/fourteen-points
[forsgren] – “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”, Nicole Forsgren PhD, Jez Humble, Gene Kim. IT Revolution Press, 3/27/2018. ISBN-10: 1942788339, ISBN-13: 978-1942788331
[freem] – “Growing Object-Oriented Software, Guided by Tests”, Steve Freeman, Nat Pryce. Addison-Wesley Professional, 10/22/2009. ISBN-10: 9780321503626, ISBN-13: 978-0321503626
[mesz] – “xUnit Test Patterns: Refactoring Test Code”, Gerard Meszaros. Addison-Wesley, 5/31/2007. ISBN-10: 9780131495050, ISBN-13: 978-0131495050. Particularly good in its discussion about dummy objects, fake obj, stubs, spies, and mocks.
[dbnm] – “No more excuses”, Donovan Brown. Donovanbrown.com, 12/12/2016. http://donovanbrown.com/post/no-more-excuses. Our personal battle cry when it comes to “asking for permission” to write unit tests.
[cohnx] – “The Forgotten Layer of the Test Automation Pyramid”, Mike Cohn. Mountain Goat Software, 12/17/2009. https://www.mountaingoatsoftware.com/blog/the-forgotten-layer-of-the-test-automation-pyramid
[williams] – “The Costs and Benefits of Pair Programming”, Alistair Cockburn, Laurie Williams, 1/1/2001. https://collaboration.csc.ncsu.edu/laurie/Papers/XPSardinia.PDF
[gucks] – “Moving 65,000 Microsofties to DevOps on the Public Cloud”, Sam Guckenheimer, 8/3/2017. https://www.visualstudio.com/learn/moving-65000-microsofties-devops-public-cloud/
[shahxr] – “Shift Left to Make Testing Fast and Reliable”, Munil Shah. Microsoft Docs, 11/8/2017. https://www.visualstudio.com/learn/shift-left-make-testing-fast-reliable/. A must-read for any serious QA devotee.
[shahyt] – “Combining Dev and Test in the Org”, Munil Shah. YouTube, 10/24/2017. https://www.youtube.com/watch?v=tj5mfW_gtRU. Microsoft’s decision to move to a single engineering organization where testing and development are unified was a game-changer.
[fowlbu] – “UnitTest”, Martin Fowler. MartinFowler.com, 5/5/2014. https://martinfowler.com/bliki/UnitTest.html
[fowltp] – “TestPyramid”, Martin Fowler, MartinFowler.com, 5/1/2012. https://martinfowler.com/bliki/TestPyramid.html
[cohn] – “Testing Pyramids & Ice-Cream Cones”, Alister Scott. Watirmelon, unknown date. https://watirmelon.blog/testing-pyramids/
[nonderminism] – “Eradicating Non-Determinism in Tests”, Martin Fowler, 4/14/2011. https://martinfowler.com/articles/nonDeterminism.html
[ddt] – “Defect Driven Testing: Your Ticket Out the Door at Five O’Clock”, Jared Richardson. Dzone.com, 8/4/2010. https://dzone.com/articles/defect-driven-testing-your . Note his thoughts on combating bugs, which tend to come in clusters, with what he calls ‘testing jazz’ – thinking in riffs with dozens of tests checking an issue like invalid spaces in input.
[stiny] – “You Are Your Software’s Immune System!”, Matt Stine. DZone.com, 7/20/2010. https://dzone.com/articles/you-are-your-softwares-immune
[molteni] – “Giving Up on test-first development”, Luca Molteni. iansommerville, 3/17/2016. http://iansommerville.com/systems-software-and-technology/giving-up-on-test-first-development/ The author found TDD unsatisfying because it encouraged conservatism, focused on detail vs structure, and didn’t catch data mismatches – which he later elaborated with other weak points, including reliance on a layered architecture, agreed upon success criteria, and a controllable operating environment. We disagree with most of his objections but agree with the cautionary note that there is no single universal engineering method that works in every and all cases.
[martin] – “The Three Laws of TDD”, Robert Martin. ButUncleBob.com, unknown date. http://butunclebob.com/ArticleS.UncleBob.TheThreeRulesOfTdd
[martin3] – “When TDD doesn’t work.”, Robert Martin. The Clean Code Blog, 4/30/2014. https://8thlight.com/blog/uncle-bob/2014/04/30/When-tdd-does-not-work.html
[humbleobj1] – “Refactoring code that accesses external services”, Martin Fowler. MartinFowler.com, 2/17/2015. https://martinfowler.com/articles/refactoring-external-service.html A great implementation of Humble Object and refactoring based on Bounded Contexts in this article.
[gruvle] – “Start and Scaling Devops in the Enterprise”, Gary Gruver. BookBaby, 12/1/2016. ISBN-10: 1483583589, ISBN-13: 978-1483583587
[gruv] – “Leading the Transformation: Applying Agile and DevOps Principles at Scale”, Gary Gruver, Tommy Mouser. IT Revolution Press, 8/1/2015. ISBN-10: 1942788010, ISBN-13: 978-1942788010. An in depth exploration of how HP was able to pull itself out of the mud of long test cycles – even with a labyrinth of possible hardware combinations.

Chapter 3 – Definition of Done, Family Dinner Code Reviews

[wieg] – “Humanizing Peer Reviews”, Karl Wiegers, Addison-Wesley, 11/2/2001, ISBN-13: 978-0201734850
[agilep1] – “The Joy of Peer Reviews (Part 1 – Code)”, The Agile Pirate, 4/14/2011, http://theagilepirate.net/archives/117
[agilep2] – “The Joy of Peer Reviews (Part 2 – Documentation)”, Simon Cromarty. The Agile Pirate, 5/24/2011. http://theagilepirate.net/archives/399 Both are excellent articles, including some nice simple checklists as a sample. “…remember the goal of a review is to share improvement opportunities, not for lazy coders to have someone else find their bugs for them or for staff to step on each other.”
[scrumdod] – “Walking Through a Definition of Done”, Ian Mitchell. Scrum.org, 5/31/2017. https://www.scrum.org/resources/blog/walking-through-definition-done
[joshi] – “Better Code Reviews”, Vaidehi Joshi. BetterCode.Reviews, unknown date. http://www.bettercode.reviews/ More antipatterns from informal survey. The comments are very insightful…
[kemp2] – “Giving better code reviews”, Joel Kemp. Medium, 1/24/2016. https://medium.com/@mrjoelkemp/giving-better-code-reviews-16109e0fdd36 – a plea for more than a brief glance by reviewers.
[codac] – “Code Review Etiquette”, unattributed author(s). Codacy.com, 10/20/2016. https://blog.codacy.com/code-review-etiquette-da212a7454c – some basic etiquette.
[cdhpr] – “Code Reviews: Just Do It”, Jeff Atwood. Coding Horror blog, 1/21/2006. https://blog.codinghorror.com/code-reviews-just-do-it. A true classic!
[jaimc] – “10 facts about code reviews and quality”, unattributed author(s). Codacy.com, 12/15/2016. https://blog.codacy.com/10-facts-about-code-reviews-and-quality-c5adf2e869fe
[schi1] – “Running an Effective Code Review”, Esther Schindler. CIO.com, 12/22/2008. https://www.cio.com/article/2431557/developer/running-an-effective-code-review.html – They noted that even if you are only “spot checking” some of the code being checked in there’s a measurable increase in quality.
[schi2] – “How NOT to Run a Code Review”, Esther Schindler. CIO.com, 12/22/2008. https://www.cio.com/article/2431553/developer/how-not-to-run-a-code-review.html – esp like the comments from Oliver Cole on the psychology behind criticism.
[sm10] – “10 tips to guide you toward effective peer code review”, unattributed author(s). Smartbear.com, unknown date. https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/
[mcd10] – “10 Principles of a Good Code Review”, Jason McDonald. Dev.to, 12/6/2017. https://dev.to/codemouse92/10-principles-of-a-good-code-review-2eg – A very nice 15 point checklist.
[atul] – “The Checklist”, Atul Gawande. New Yorker Magazine, 12/10/2007. https://www.newyorker.com/magazine/2007/12/10/the-checklist – Why do pilots use checklists as a standard prereq for any flight, while developers rarely or never use them?
[ibmrl] – “11 proven practices for more effective, efficient peer code review”, Jason Cohen. IBM, 1/25/2011. https://www.ibm.com/developerworks/rational/library/11-proven-practices-for-peer-review/ – Outstanding article.
[jarma] – “Giving and Receiving Great Code Reviews”, Sam Jarman. dev.to, 6/25/2017. https://dev.to/samjarman/giving-and-receiving-great-code-reviews – love the 6 specific questions the author looks for on pull requests.
[gruvle] – “Start and Scaling Devops in the Enterprise”, Gary Gruver. BookBaby, 12/1/2016. ISBN-10: 1483583589, ISBN-13: 978-1483583587. There’s definitely a lot of books out there with better quality graphics and a slicker presentation than this book. There are none with better content. Gary Gruver wrote one classic in “Leading the Enterprise“, about his transformation efforts at HP. This book is more of a workbook, and after reading it you’ll be in a much better position to analyze the flow of value. I can’t say enough about this book, though I tried on my blog… I’ve read it about three times, every six months or so, and always learn something new. You need to have it in your library.
[tylerh] – Interview with Tyler Hardison by Dave Harrison, see Appendix.
[terrja] – “Doing Terrible Things To Your Code”, Jeff Atwood. Coding Horror blog, 7/30/2015. https://blog.codinghorror.com/doing-terrible-things-to-your-code/ – the assumptions we commonly make as programmers about ‘simple’ thinks like names, dates, geography, gender, addresses etc – all are often wrong, and any good tester/mentor can and should expose them in the review process. Or, your users will…
[nonnenberg] – “Top ten pull request review mistakes”, Scott Nonnenberg. ScottNonnenberg.com, 1/25/2017. https://blog.scottnonnenberg.com/top-ten-pull-request-review-mistakes/
[rodgw] – “Why I Have Given Up on Coding Standards”, Richard Rodger. RichardRodger.com, 11/3/2012. http://www.richardrodger.com/2012/11/03/why-i-have-given-up-on-coding-standards/#.WxWABPZFxPY . The statement about power-mad architects definitely rings home with some of our past experiences.
[kief] – “Infrastructure as Code: Managing Servers in the Cloud”, Kief Morris. O’Reilly Media, 6/27/2016. ISBN-10: 1491924357, ISBN-13: 978-1491924358
The concept of a family dinner can be seen in several places – notably the Death and Company cookbook (“Death & Co: Modern Classic Cocktails”, Kaplan/Fauchald. Ten Speed Press, 1/1/2014. 978-1607745259) and in the Netflix series Chef’s Table (Christina Tosi, Season 4 Episode 1), and the osmosis learning process used by Bestia as described in https://www.huffingtonpost.com/2015/06/23/family-meal-restaurant_n_7566654.html. The way thoughts are asked at these dinners is very informative – for example, “I don’t think the Perry’s Tot and Sherry are playing well together. What about Old Tom?” “What if you split the rye whiskey with something lower proof?” “It’s nice but needs a bump.” This is a great practical example of kindergarten rules – informative, helpful, and not a personal attack.

Chapter 4 – Blameless Postmortems

[dora2016] – “Annual State of DevOps Report”, unattributed author(s). Puppet Labs, 2016. https://puppetlabs.com/2016-devops-report
[lenci] – “The Five Dysfunctions of a Team: A Leadership Fable”, Patrick Lencioni. Jossey-Bass, 4/11/2002. ISBN-13: 978-0787960759, ISBN-10: 0787960756. Great on Audible, and some really thought-provoking content. I often turn back to this; and as mentioned in the book, it pairs up evenly with the Westrum study made famous by DORA.
[westrum] – “A typology of organisational cultures”, Ron Westrum. BMJ Quality & Safety, 2004;13:ii22-ii27, https://qualitysafety.bmj.com/content/13/suppl_2/ii22
[doran] – “There’s a S.M.A.R.T. Way to Write Management’s Goals and Objectives”, Doran, G. T. Management Review, Vol. 70, Issue 11, 1/1/1981. https://community.mis.temple.edu/mis0855002fall2015/files/2015/10/S.M.A.R.T-Way-Management-Review.pdf
[victorops] – “VictorOps Guide to Blameless Post-mortems”, unattributed author(s). VictorOps, 9/30/2014. https://www.slideshare.net/VictorOps/victor-ops-guide-to-blameless-post-mortems A great slideshare on how to set up and run a blameless postmortem.
[docaf65] – “DevOps Cafe Episode 65 – John interviews Damon”, John Willis, Damon Edwards. DevOps Café, 12/15/2015. http://devopscafe.org/show/2015/12/15/devops-cafe-episode-65-john-interviews-damon.html
[dekaiz] – “DOES15 – Damon Edwards – DevOps Kaizen Practical Steps to Start & Sustain a Transformation”, Damon Edwards. DevOps Enterprise Summit 2015, YouTube, 11/5/2015. https://www.youtube.com/watch?v=RT542sffJpM
[sre] – “Site Reliability Engineering: How Google Runs Production Systems”, Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff, O’Reilly Media; 4/16/2016, ISBN-10: 149192912X, ISBN-13: 978-1491929124. Appendix D has an excellent sample postmortem.
[allspaw] – “Blameless PostMortems and a Just Culture”, John Allspaw. Code as Craft / Etsy, 5/22/2012. https://codeascraft.com/2012/05/22/blameless-postmortems/ John Allspaw’s seminal post on how “blameless postmortems” actually work at Etsy. Note how they openly discuss attribution bias and how they plan to counter it.
[forsgren] – “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”, Nicole Forsgren PhD, Jez Humble, Gene Kim. IT Revolution Press, 3/27/2018. ISBN-10: 1942788339, ISBN-13: 978-1942788331
[zwieb] – “Beyond Blame: Learning From Failure and Success”, Dave Zwieback. O’Reilly Media, 10/29/2015. ISBN-10: 1491906413, ISBN-13: 978-1491906415
[dickerson] – “Etsy’s Winning Secret: Don’t Play The Blame Game!”, Owen Thomas. Business Insider, 5/15/2012. http://www.businessinsider.com/etsy-chad-dickerson-blameless-post-mortem-2012-5
[dekker] – “Behind Human Error”, Sidney Dekker, David Woods. CRC Press, 9/30/2010. ISBN-13: 978-0754678342, ISBN-10: 0754678342. Etsy and other companies has mentioned this book and its discussion of First Stories vs Second Stories many times.
[schauenberg] – “Practical Postmortems at Etsy”, Daniel Schauenberg. InfoQ, 8/22/2015. https://www.infoq.com/articles/postmortems-etsy
[pullen] – “5 Whys – how we conduct blameless post-mortems after something goes wrong”, Noel Pullen, Hootsuite, http://code.hootsuite.com/blameless-post-mortems/ An excellent war story about how Google handles failure recovery and postmortems.
[milste] – “How to Run a 5 Whys (With Humans, Not Robots)”, Dan Milstein. The Lean Startup Conference, YouTube, 1/27/2013, https://www.youtube.com/watch?v=78qzrXIPn5Q
[niseq] – “Why Etsy engineers send company-wide emails confessing mistakes they made”, Max Nisen. Quartz, 9/18/2015. https://qz.com/504661/why-etsy-engineers-send-company-wide-emails-confessing-mistakes-they-made/
[malpas] – “Fallible Humans”, Ian Malpass. Indecorous.com, 7/20/2014, http://indecorous.com/fallible_humans/ An excellent case study on how failure is handled in practice at Etsy.
[fostps] – “Tool: Foster psychological safety”, unattributed author(s). re:work, Google, https://rework.withgoogle.com/guides/understanding-team-effectiveness/steps/foster-psychological-safety/
[dekkt] – “The Field Guide to Understanding Human Error”, Sidney Dekker. CRC Press; 6/28/2006. ISBN-10: 0754648265, ISBN-13: 978-0754648260
[harll] – “What blameless really means”, Jessica Harllee. JessicaHarllee.com, 3/10/2014. http://www.jessicaharllee.com/notes/what-blameless-really-means/ – One SRE’s thoughts about how blameless postmortems actually work in practice at Etsy.
[macri] – “Morgue: Helping Better Understand Events by Building a Post Mortem Tool”, Bethany Macri. DevOpsDays.org, Vimeo, 10/18/2013. https://vimeo.com/77206751 – How and why the Morgue postmortem tool was created at Etsy. This tool is publicly available on GitHub; see https://github.com/etsy/morgue
[joao] – InfoQ, “How Etsy Deploys More Than 50 Times a Day”, João Miranda. InfoQ Magazine, 3/17/2014. https://www.infoq.com/news/2014/03/etsy-deploy-50-times-a-day
[allspcf] – “Counterfactual Thinking, Rules, and The Knight Capital Accident”, John Allspaw. KitchenSoap.com, 10/29/2013. https://www.kitchensoap.com/2013/10/29/counterfactuals-knight-capital/ The best discussion I’ve seen to date on the Knight Capital disaster and the role of counterfactuals in our analysis.
[dbtmph] – “The Man Who Tried to Stop Pearl Harbor”, David J. Castello. The Daily Beast, 7/12/2016. https://www.thedailybeast.com/the-man-who-tried-to-stop-pearl-harbor . The story of George Elliott and his failure to prevent Pearl Harbor; a fantastic example of a “second story” long kept hidden.
[harrym] – “A Rough Patch”, Brian Harry. MSDN, 11/25/2013. https://blogs.msdn.microsoft.com/bharry/2013/11/25/a-rough-patch/ . One of the best real-world examples I’ve seen of a true blameless postmortem that has teeth, following several very high-visibility outages. “Either I’m going to get increasingly good at apologizing to fewer and fewer people or we’re going to get better at this. I vote for the latter.”
[tdoh] – “The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations”, Gene Kim, Patrick Dubois, John Willis, Jez Humble. IT Revolution Press, 10/6/2016, ISBN-10: 1942788002, ISBN-13: 978-1942788003
[arist] –”What Google Learned From Its Quest to Build the Perfect Team”, Charles Duhigg, 2/25/2016, NY Times Magazine, https://www.nytimes.com/2016/02/28/magazine/what-google-learned-from-its-quest-to-build-the-perfect-team.html

Chapter 4 – Hypothesis Driven Development

[donovan] – “Stop Getting Stuff Done After You Said You Couldn’t”, Donovan Brown. donovanbrown.com, 3/17/2017. http://donovanbrown.com/post/Stop-Getting-Stuff-Done-After-You-Said-You-Couldnt
[pokert] – “When is it OK to Fold Aces?”, Malcolm Clark. PokerTube.com, 6/22/2016. https://www.pokertube.com/article/when-is-it-ok-to-fold-aces
[pokerns] – “Would You Fold Pocket Aces Postflop In This Spot?”, Martin Harris. PokerNews, 5/8/2017. https://www.pokernews.com/strategy/would-you-fold-pocket-aces-postflop-in-this-spot-27861.htm . The source for the pocket aces fold story comes from this article.
[standish] – “Standish Group 2015 Chaos Report – Q&A with Jennifer Lynch”, Stéphane Wojewoda, Shane Hastie. InfoQ, 10/4/2015. https://www.infoq.com/articles/standish-chaos-2015 – From 2011-2015, the number of “successful” vs challenged/failed projects held rock steady at about 29%. Interestingly, the smaller the project was, the greater its chance of success; small projects had a 62% success rate versus only a 2-6% chance for grand/large sized projects.
[tdoh] – “The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations”, Gene Kim, Patrick Dubois, John Willis, Jez Humble. IT Revolution Press, 10/6/2016, ISBN-10: 1942788002, ISBN-13: 978-1942788003. It quotes Ronny Kohavi at MSFT as saying that after evaluating well-designed and executed experiments, only 1/3rd of features were successful at improving the key metric they were targeting!
[highsmith] – “Agile Project Management: Creating Innovative Products”, Jim Highsmith. Addison-Wesley, 1/1/2009. ISBN-13: 978-0321658395
[jjuicbo] – “Interview: Jim Johnson of the Standish Group”, Deborah Hartmann Preuss. InfoQ, 8/25/2006. https://www.infoq.com/articles/Interview-Johnson-Standish-CHAOS
[lean] – “Lean Enterprise: How High Performance Organizations Innovate at Scale”, Jez Humble, Joanne Molesky, Barry O’Reilly. O’Reilly Media, 1/3/2015. ISBN-10: 1449368425, ISBN-13: 978-1449368425. Excellent section by Ash Maury on “Running Lean” on traditional PMO orgs clash with hypothesis-driven development, and separately on the OODA loop.
[siddharta] – “The biggest waste in software development”, Siddharta X. Tools For Agile blog, 3/26/2010. http://toolsforagile.com/blog/archives/260/the-biggest-waste-in-software-development
[fowldich] – “UtilityVsStrategicDichotomy”, Martin Fowler. MartinFowler.com, 7/29/2010. https://martinfowler.com/bliki/UtilityVsStrategicDichotomy.html – Deciding when to buy versus build is a hard decision; Martin Fowler splits this up by asking if it drives actual value for the customer – or if it’s a utility function.
[morec] – “Lean and fast — using A3 to save your program”, John A. Moreci. Project Management Institute, 10/26/2014. https://www.pmi.org/learning/library/lean-fast-using-a3-save-program-9270
[lindea] – “Early Amazon: Shopping cart recommendations”, Greg Linden. Glinden.blogspot, 4/25/2006. http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html . A great account of an A/B test saving a valuable new feature early on for Amazon.
[kimbre] – “An Interview with Jez Humble on Continuous Delivery, Engineering Culture, and Making Decisions”, Kimbre Lancaster. split.io, 8/16/2018. https://www.split.io/blog/jez-humble-interview-decisions-2018/
[harris] – “Using feature flags in your app release management strategy”, Richard Harris. App Developer Magazine, 4/19/2018. https://appdevelopermagazine.com/5983/2018/4/16/Using-feature-flags-in-your-app-release-management-strategy/

Chapter 4 – Value Stream Mapping

[teams] – “Team of Teams: New Rules of Engagement for a Complex World”, Stanley McChrystal. Portfolio, 5/12/2015. ISBN-10: 1591847486, ISBN-13: 978-1591847489. The second most influential book we read, besides “The Power of Habit”. Highly recommended either printed or on Audible; it’s a fast read, and amazingly insightful.
[ohno] – “Toyota Production System: Beyond Large-Scale Production”, Taiichi Ohno. Productivity Press; 3/1/1988, ISBN-10: 0915299143, ISBN-13: 978-0915299140
[shingo] – “A Study of the Toyota Production System: From an Industrial Engineering Viewpoint (Produce What Is Needed, When It’s Needed)”, Shigeo Shingo, Andrew P. Dillon. Productivity Press; 10/1/1989. ISBN-10: 9780915299171, ISBN-13: 978-0915299171
[popp] – “Implementing Lean Software Development: From Concept to Cash”, Mary and Tom Poppendieck. Addison-Wesley Professional, 9/17/2006. ISBN-10: 0321437381, ISBN-13: 978-0321437389
[jeffmu] – “The Multitasking Myth”, Jeff Atwood. Coding Horror Blog, 9/27/2006. https://blog.codinghorror.com/the-multi-tasking-myth/
[liker] – “The Toyota Way: 14 Management Principles from the World’s Greatest Manufacturer”, Jeffrey K. Liker, McGraw-Hill Education; 1/7/2004, ISBN-10: 0071392319, ISBN-13: 978-0071392310
[devcaf65] – “DevOps Cafe Episode 62 – Mary and Tom Poppendieck”, Damon Edwards, John Willis. DevOps Café, 8/16/2015. http://devopscafe.org/show/2015/8/16/devops-cafe-episode-62-mary-and-tom-poppendieck.html
[willis] – “DevOps Culture (Part 1)”, John Willis. IT Revolution, 5/1/2012. https://itrevolution.com/devops-culture-part-1/ This is an extremely influential blog; I found myself turning back to it many times.

Chapter 5 – Small Cross Functional Teams

[tdoh] – “The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations”, Gene Kim, Patrick Dubois, John Willis, Jez Humble. IT Revolution Press, 10/6/2016, ISBN-10: 1942788002, ISBN-13: 978-1942788003
[domenic] – “Making Work Visible: Exposing Time Theft to Optimize Work & Flow”, Dominica Degrandis, 11/14/2017, IT Revolution Press; ISBN-10: 1942788150, ISBN-13: 978-1942788157
[mcchryst] – “Team of Teams: New Rules of Engagement for a Complex World”, Stanley McChrystal. Portfolio, 5/12/2015. ISBN-10: 1591847486, ISBN-13: 978-1591847489
[rother] – “Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results”, Mike Rother. McGraw-Hill Education, 8/4/2009. ISBN-10: 0071635238, ISBN-13: 978-0071635233

Chapter 5 – Configuration Management and Infrastructure as Code

[rbias] – “The History of Pets vs Cattle and How to Use the Analogy Properly”, Randy Bias. CloudScaling.com, 9/29/2016. http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/
[kief] – “Infrastructure as Code: Managing Servers in the Cloud”, Kief Morris. O’Reilly Media, 6/27/2016. ISBN-10: 1491924357, ISBN-13: 978-1491924358
[cern] -“Are your servers PETS or CATTLE?”, Simon Sharwood. The Register, 3/18/2013. https://www.theregister.co.uk/2013/03/18/servers_pets_or_cattle_cern/
[guckiac] – “What is Infrastructure as Code?”, Sam Guckenheimer. Microsoft Docs, 4/3/2017. https://docs.microsoft.com/en-us/azure/devops/learn/what-is-infrastructure-as-code
[russd] – “It Takes Dev and Ops to Make DevOps”, Russ Collier. DevOpsOnWindows.com, 7/26/2013. http://www.devopsonwindows.com/it-takes-dev-and-ops-to-make-devops/
[puppiac] – “Infrastructure as code”, unattributed author(s). Puppet, unknown date. https://puppet.com/solutions/infrastructure-as-code – A great overview with videos of why IAC is so important
[newm] “Building Microservices: Designing Fine-Grained Systems”, Sam Newman. O’Reilly Media, 2/20/2015. ISBN-10: 1491950358, ISBN-13: 978-1491950357
[yevg] – “Terraform: Up and Running: Writing Infrastructure as Code”, Yevgeniy Brikman. O’Reilly Media, 3/27/2017. ISBN-10: 1491977086, ISBN-13: 978-1491977088
[sre] – “Site Reliability Engineering: How Google Runs Production Systems”, Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff, O’Reilly Media; 4/16/2016, ISBN-10: 149192912X, ISBN-13: 978-1491929124
[gruvle] – “Start and Scaling Devops in the Enterprise”, Gary Gruver. BookBaby, 12/1/2016. ISBN-10: 1483583589, ISBN-13: 978-1483583587

Chapter 5 – Security As Part of the Lifecycle

[payne] – “DevOps and Security: 5 Principles for DevSecOps”, Jeffrey Payne. TechWell, 8/3/2018. https://www.techwell.com/techwell-insights/2018/08/devops-and-security-5-principles-devsecops
[corman] – “DevOps Cafe Episode 63 – Josh Corman”, DevOps Café, 9/2/2015, http://devopscafe.org/show/2015/9/2/devops-cafe-episode-63-josh-corman.html
[rugged] – “The Rugged Manifesto”, unattributed author(s). RuggedSoftware.org, 1/1/2010. https://www.ruggedsoftware.org
[thmodel] – “Threat Modeling”, unattributed author(s). Microsoft Security Engineering, unknown date. https://www.microsoft.com/en-us/securityengineering/sdl/threatmodeling Microsoft’s approach to Threat Modeling and security as part of the lifecycle.
[douglci] – “Learn how to add continuous security validation to your CI/CD pipeline”, Mike Douglas and others. Microsoft Docs, 4/25/2018. https://docs.microsoft.com/en-us/vsts/articles/security-validation-cicd-pipeline?view=vsts
[prieur] – “ALM and DevOps – Secure and Deliver with Rugged DevOps”, Jean-Marc Prieur, Sam Guckenheimer. MSDN, 1/1/2016. https://msdn.microsoft.com/en-us/magazine/mt790188.aspx?f=255&MSPPError=-2147217396
[gucksec] – “Security In Your Continuous Integration Pipeline”, Sam Guckenheimer. WhiteSource, YouTube, 8/30/2017. https://www.youtube.com/watch?v=C1CPN0ArZJs
[gotim] – “A Definition of Done for DevSecOps”, Gene Gotimer. TechWell, 5/8/2018. https://www.techwell.com/techwell-insights/2018/05/definition-done-devsecops
[reed] – “Want rugged DevOps? Team up your release and security engineers”, J Paul Reed. TechBeacon, unknown date. https://techbeacon.com/want-rugged-devops-team-your-release-security-engineers
[barth] – “Deflating news: Bouncy Castle BKS-V1 keystore files not adequately protected”, Bradley Barth. SC Media, 3/19/2018. https://www.scmagazine.com/deflating-news-bouncy-castle-bks-v1-keystore-files-not-adequately-protected/article/751885/
[owasp] – “OWASP Periodic Table of Vulnerabilities”, unattributed author(s). OWASP, 2/12/2016. https://www.owasp.org/index.php/OWASP_Periodic_Table_of_Vulnerabilities#tab=Periodic_Table_of_Vulnerabilities
[zanel] – “DevSecOps: How to Use DevOps to Make You More Secure”, Zane Lackey. IT Revolution, 8/26/2018. https://itrevolution.com/devsecops-zane-lackey/

Chapter 5 – Automated Jobs and Dev Production Support

[maun] – “Rundeck Helps Ticketmaster Reshape Operations”, unattributed author(s). Rundeck.org, 1/1/2015. http://rundeck.org/stories/mark_maun.html – Note the strong objections by both developers and Operations (costs, risks, SOX and security compliance, straightjacketed solution sets and loss of control). This resistance dropped on both sides as a lengthy pilot period proved that runbooks provided both simplicity and auditable, repeatable, and traceable action steps that simplified troubleshooting.
[pagr] – “Incident Response”, unattributed author(s). PagerDuty, unknown date. https://response.pagerduty.com/ An excellent documentation hub on how to handle initial response.
[mulkey2] – “DevOps Cafe Episode 61 – Jody Mulkey”, John Willis, Damon Edwards. DevOps Café, 7/27/2015. http://devopscafe.org/show/2015/7/27/devops-cafe-episode-61-jody-mulkey.html
[newm] “Building Microservices: Designing Fine-Grained Systems”, Sam Newman. O’Reilly Media; 2/20/2015. ISBN-10: 1491950358, ISBN-13: 978-1491950357
[sharma] – “The DevOps Adoption Playbook: A Guide to Adopting DevOps in a Multi-Speed IT Enterprise”, Sanjeev Sharma. Wiley, 2/28/2017. ISBN-10: 9781119308744, ISBN-13: 978-1119308744
[gruvle] – “Start and Scaling Devops in the Enterprise”, Gary Gruver, BookBaby, 12/1/2016. ISBN-10: 1483583589, ISBN-13: 978-1483583587
[sre] – “Site Reliability Engineering: How Google Runs Production Systems”, Niall Richard Murphy, Betsy Beyer, Chris Jones, Jennifer Petoff, O’Reilly Media; 4/16/2016, ISBN-10: 149192912X, ISBN-13: 978-1491929124

Chapter 6 – Metrics and Monitoring

[babb] – “Fly-Fishin’ Fool: The Adventures, Misadventures, and Outright Idiocies of a Compulsive Angler”, James Babb. Lyons Press; 4/1/2005. ISBN-10: 1592285937, ISBN-13: 978-1592285938
[theart] – “The Art of Monitoring”, James Turnbull. Amazon Digital Services LLC, 6/8/2016. ASIN: B01GU387MS. Perhaps the best overall discussion we’ve seen of monitoring and a very good, explicit implementation of the ELK stack to handle aggregation and dashboarding. See my blog post for more on this outstanding work.
[guckenheimer2] – “Moving 65,000 Microsofties to DevOps on the Public Cloud”, Sam Guckenheimer. Microsoft Docs, 8/3/2017. https://docs.microsoft.com/en-us/azure/devops/devops-at-microsoft/moving-65000-microsofties-devops-public-cloud
[hawthorne] – “The Hawthorne effect”, Tom Hindle. The Economist, 11/3/2008. https://www.economist.com/news/2008/11/03/the-hawthorne-effect
[baer] – “How Changing One Habit Helped Quintuple Alcoa’s Income”, Drake Baer. Business Insider, 4/19/2014. https://www.businessinsider.com/how-changing-one-habit-quintupled-alcoas-income-2014-4
[popp4] – “DevOps Cafe Episode 62 – Mary and Tom Poppendieck”, John Willis, Damon Edwards. DevOps Café, 8/16/2015. http://devopscafe.org/show/2015/8/16/devops-cafe-episode-62-mary-and-tom-poppendieck.html
[visible] – “The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps”, Kevin Behr, Gene Kim, George Spafford. Information Technology Process Institute, 6/15/2005. ISBN-10: 0975568612, ISBN-13: 978-0975568613. We wish this short but powerful book was better known. Like Continuous Delivery”, it’s aged well – and most of its precepts still hold true. It resonates particularly well with IT managers and Operations staff.
[rayg2] – “Customer focus and making production visible with Raygun”, Damian Brady. Channel9, 2/8/2018. https://channel9.msdn.com/Shows/DevOps-Lab/Customer-focus-and-making-production-visible-with-Raygun?WT.mc_id=dlvr_twitter_ch9
[hubbard] – “How to Measure Anything: Finding the Value of Intangibles in Business”, Douglas Hubbard. Wiley Publishing, 3/17/2014. ISBN-10: 9781118539279, ISBN-13: 978-1118539279
[turnbull] – “DevOps Cafe Episode 70 – James Turnbull”, John Willis, Damon Edwards. DevOps Café, 10/26/2016. http://devopscafe.org/show/2016/10/26/devops-cafe-episode-70-james-turnbull.html
[cockr] – “DevOps Cafe Episode 50 – Adrian Cockcroft”, John Willis, Damon Edwards. DevOps Café, 7/22/2014. http://devopscafe.org/show/2014/7/22/devops-cafe-episode-50-adrian-cockcroft.html. I love this interview in part for Adrian calling out teams that are stuck in analysis paralysis – and the absurdity of not giving teams self-service environment provisioning. “First I ask… are you serious?”
[julian] – “Practical Monitoring: Effective Strategies for the Real World”, Mike Julian. O’Reilly Media, 11/23/2017. ISBN-10: 1491957352, ISBN-13: 978-1491957356. I think this may actually be a little better than “The Art of Monitoring” – though that’s also a book we loved and found value in – just because there’s less of a narrow focus on the ELK stack.
[habit] – “The Power of Habit: Why We Do What We Do in Life and Business”, Charles Duhigg. Random House, 1/1/2014. ISBN-10: 081298160X, ISBN-13: 978-0812981605
[bejtlich] – “The Practice of Network Security Monitoring: Understanding Incident Detection and Response”, Richard Bejtlich. No Starch Press, 7/15/2013. ISBN-10: 1593275099, ISBN-13: 978-1593275099

Chapter 6 – Feature Flags and Continuous Delivery

[mugrage] – “It’s Not Continuous Delivery if You Can’t Deploy Right Now”, Ken Mugrage. InfoQ, 7/20/2018. https://www.infoq.com/presentations/cd-deployment-pipelines
[danno] – “The Journey to Continuous Delivery”, Dan North. InfoQ Magazine, 4/10/2018. https://www.infoq.com/presentations/cd-business-agility . Dan advocates not attempting to boil the ocean, but choosing one tasty, low-lying project to go after that drives real business value.
[cd] – “Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation”, Jez Humble, David Farley. Addison-Wesley Professional, 8/6/2010. ISBN-10: 9780321601919, ISBN-13: 978-0321601919.
This book made a giant splash when it first came out and is still having great impact to this day. I reviewed it on my blog; suffice to say, it’s a must-have.
[kimbre] – “An Interview with Jez Humble on Continuous Delivery, Engineering Culture, and Making Decisions”, Kimbre Lancaster. split.io, 8/16/2018. https://www.split.io/blog/jez-humble-interview-decisions-2018/
[harris] – “Using feature flags in your app release management strategy”, Richard Harris. App Developer Magazine, 4/19/2018. https://appdevelopermagazine.com/5983/2018/4/16/Using-feature-flags-in-your-app-release-management-strategy/
[patang] – “Best of Velocity: Move Fast and Ship Things – Facebook’s Operational and Release Processes”, Girish Patangay. O’Reilly Media, YouTube, 9/9/2013. https://www.youtube.com/watch?v=dDf2t-E_Ea8&feature=youtu.be&t=11m20s A great 18-minute detail going over Facebook’s implementation of feature flags and safely introducing changes.
[baker] – “Feature Flag-Driven Development”, Justin Baker. LaunchDarkly, 11/7/2015. https://launchdarkly.com/blog/feature-flag-driven-development/ A very readable overview, including some nifty graphical descriptions of the different use cases with FF.
[harmes] – “Flipping Out”, Ross Harmes. 12/2/2009, Flickr. http://code.flickr.net/2009/12/02/flipping-out/ . A very influential post (if short!) that describes Flickr’s release patterns and use of feature flags.
[ldbp] – “Best Practices”, unattributed author(s). GitHub, 3/5/2018. https://github.com/launchdarkly/featureflags/blob/master/5%20-%20Best%20Practices.md – answers common questions around naming and usage conventions, and the importance of giving access to non-devs.
[lduc] – “Use Cases”, unattributed author(s). LaunchDarkly.com, unknown dates. https://launchdarkly.com/use-cases/?utm_source=launchdarkly_blog&utm_medium=organic
[ffio] – “Open Source Resources”, unattributed author(s). FeatureFlags.IO, unknown dates. http://featureflags.io/resources/ – An outstanding documentation and guidance hub.
[bird] – “Feature Toggles are one of the worst kinds of Technical Debt”, Jim Bird. SwReflections.Blogspot, 8/6/2014. http://swreflections.blogspot.com/2014/08/feature-toggles-are-one-of-worst-kinds.html. It’s hard to argue with Jim’s list of risks: that feature flags are meant to be short-lived and represent technical debt if left untended; if overused they can become an antipattern. Once again, there are no silver bullets.
[ds2014] – “Knightmare: A DevOps Cautionary Tale”, Doug Seven. DougSeven.com, 4/7/2014. https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/ Absolutely chilling. This is the story of how a company with nearly $400 million in assets went bankrupt in 45 minutes, all because of a failed deployment. Here the real issue wasn’t the reliance on feature flags – it’s what wasn’t there, better automation around configuration, a well-rehearsed deployment cycle, and robust testing. Feature flags can’t compensate for the lack of automation and good process, or believing that handing off a written set of instructions is repeatable and foolproof. As Google says, “hope is not a strategy.”
[garve] – “Better development with Feature Flags”, Leonard Garvey. reinteractive.com, 10/28/2014. https://reinteractive.com/posts/220-better-development-with-feature-flags Leonard describes a few more little-known benefits of feature flags – including it makes collaboration with other developers easier, reduces the risk of conflicting code, and provides the ability to roll out immature features with less risk. We wouldn’t use the exact code implementation he describes today, but the principles hold true.
[travisci] – “Using Feature Flags to Ship Changes with Confidence”, Mathias Meyer. Travis-CI.com, 3/4/2014. https://blog.travis-ci.com/2014-03-04-use-feature-flags-to-ship-changes-with-confidence. How one company uses feature flags to enable CI, including some nice implementation details using Ruby.
[mfbl] – “FeatureToggle”, Martin Fowler. MartinFowler.com, 10/29/2010. https://martinfowler.com/bliki/FeatureToggle.html
[bakerx] – “Enterprise Requirements for Managing Feature Flags”, Justin Baker. LaunchDarkly, 3/4/2016. https://blog.launchdarkly.com/enterprise-requirements-for-managing-feature-flags/ A nice overview of how to manage the lifecycle of feature flags so they don’t become technical debt.
[wang4] – “Microsoft’s Abel Wang on the Key to Implementing Advanced DevOps: Feature Flags”, Becky Nagel. Visual Studio Magazine, 2/5/2018. https://visualstudiomagazine.com/articles/2018/02/02/advanced-devops.aspx?m=1 We’re huge fans of LaunchDarkly at Microsoft.
[medx] – “Edith Harbaugh, LaunchDarkly”, unattributed author(s). Medium DFJ Posts, 2/8/2018. https://medium.com/dfj-vc/edith-harbaugh-launchdarkly-3cadf0123f15 . “Everyone talks about knowing real customer needs, but every customer will tell you something different. I want to know what people actually want and build that, rather than build stuff that nobody wants.”
[hodgx] – “Progressive Experimentation with Feature Flags”, Buck Hodges. Microsoft Docs, 11/13/2017. https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/progressive-experimentation-feature-flags A very detailed overview of how Microsoft has applied feature flags with Azure DevOps.
[tdoh] – “The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations”, Gene Kim, Patrick Dubois, John Willis, Jez Humble. IT Revolution Press, 10/6/2016, ISBN-10: 1942788002, ISBN-13: 978-1942788003. From a 2009 John Allspaw letter to Flickr, page 173.

Chapter 6 – Disaster Recovery and Gamedays

[dyn1] – “The Dynatrace Unbreakable Pipeline in Azure DevOps and Azure? Bam!”, Abel Wang. AbelSquidHead.com, 8/3/2018. https://abelsquidhead.com/index.php/2018/08/03/the-dynatrace-unbreakable-pipeline-in-Azure DevOps-and-azure-bam/ We would have loved to have gone into much more detail around self-healing CD pipelines and especially the advances made by Dynatrace. Monitoring as Code as a concept is rapidly growing in popularity; we love the application of using automated monitoring for a more viable go/no go decision, and having monitoring (monspec) files kept in source control right next to the other infrastructure and source code of the project.
[dyn2] – “Unbreakable DevOps Pipeline: Shift-Left, Shift-Right & Self-Healing”, Andreas Grabner. DynaTrace, 2/9/2018. https://www.dynatrace.com/news/blog/unbreakable-devops-pipeline-shift-left-shift-right-self-healing/ A great walkthrough of implementing an unbreakable CD pipeline, in this case using AWS Lambda functions and Dynatrace. Andreas makes a great case for applying the Shift-Left movement to monitoring as code.
[dop65] – “DevOps Cafe Episode 65 – John interviews Damon”, John Willis, Damon Edwards. DevOps Café, 12/15/2015. http://devopscafe.org/show/2015/12/15/devops-cafe-episode-65-john-interviews-damon.html A great discussion about the antipatterns around the releases and the dangerous illusion of control that many managers suffer from. In one company, they had less than 1% of CAB submittals rejected – out of 2,000 approved. Those that were rejected often had not filled out the correct submittal form! As Damon brought out, all this activity was three degrees removed from the keyboard – those making the approvals really had very little idea of what was actually going on. [dop65]
[dri2] – “Monitoring, and Why It Matters To You”, Dave Harrison. driftboatdave.com, 4/4/2017. https://driftboatdave.com/2017/04/04/monitoring-and-why-it-matters-to-you/ A more complete discussion of the vicious vs virtuous cycle described in this section, along with some specific examples from Etsy’s groundbreaking work around monitoring.
[tdoh] – “The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations”, Gene Kim, Patrick Dubois, John Willis, Jez Humble. IT Revolution Press, 10/6/2016, ISBN-10: 1942788002, ISBN-13: 978-1942788003. There’s an excellent story by Heather Mickman of Target about what it took to yank an antique process centered around what they called the TEAP-LARB form. “The surprising thing was that no one knew, outside of a vague notion that we needed some sort of governance process. Many knew that there had been some sort of disaster that could never happen again years ago, but no one could remember exactly what that disaster was.”
[forsgren] – “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”, Nicole Forsgren PhD, Jez Humble, Gene Kim. IT Revolution Press, 3/27/2018. ISBN-10: 1942788339, ISBN-13: 978-1942788331
[dora2017] – “Annual State of DevOps Report”, unattributed author(s). Puppet Labs, 2017. https://puppetlabs.com/2017-devops-report
[mcchrystal] – “Team of Teams: New Rules of Engagement for a Complex World”, Stanley McChrystal. Portfolio, 5/12/2015. ISBN-10: 1591847486, ISBN-13: 978-1591847489. The author notes that top-down decisionmaking (as with CAB meetings) has the effect of sapping firepower and initiative; this was echoed by Brian Blackman and Anne Steiner in their interviews in the Appendix section. The military has learned the limitations of higher command, and strives not to command more than is necessary or plan beyond the circumstances that can be foreseen. Orders are given that define and communicate the intent, but the execution strategy is often left up to the individual units.
[catafl] – “CatastrophicFailover”, Martin Fowler. MartinFowler.com, 3/7/2005. https://martinfowler.com/bliki/CatastrophicFailover.html . A vivid description of a cascading failure and the complexities associated with event-driven architectures that informed the failure Alex experienced in this section.
[matr] – “Making Matrixed Organizations Successful with DevOps: Tactics for Transformation in a Less Than Optimal Organization”, Gene Kim. IT Revolution DevOps Enterprise Forum 2017. https://itrevolution.com/book/making-matrixed-organizations-successful-devops/ A good discussion on how and why to form a cross-functional team, starting with the leadership level.
[gruvle] – “Start and Scaling Devops in the Enterprise”, Gary Gruver. BookBaby, 12/1/2016. ISBN-10: 1483583589, ISBN-13: 978-1483583587

Chapter 7 – Microservices

[newm] “Building Microservices: Designing Fine-Grained Systems”, Sam Newman. O’Reilly Media; 2/20/2015. ISBN-10: 1491950358, ISBN-13: 978-1491950357. SUCH a great book, definitely on my top 3 list on this subject.
[bbom] – “Big Ball of Mud”, Brian Foote and Joseph Yoder. University of Illinois at Urbana-Champaign, 6/26/1999. http://www.laputan.org/mud/mud.html Based on a presentation at the Fourth Conference on Patterns Languages of Programs 1997, the original and very well known “big ball of mud” paper.
[yarrow] – “The Org Charts Of All The Major Tech Companies”, Jay Yarrow. Business Insider, 6/29/2011, https://www.businessinsider.com/big-tech-org-charts-2011-6
[manu] – “The Google Doodler”, Manu Cornet. Ma.nu, 2011. http://ma.nu/about/aboutme/2013.07.15_theartofdoing_googler_doodler.pdf
[feathers] – “Working Effectively with Legacy Code”, Michael Feathers. Prentice Hall, 10/2/2004. ISBN-13: 978-0131177055, ISBN-10: 9780131177055
[fowl2] – “Microservices”, James Lewis and Martin Fowler. MartinFowler.com, 3/25/2014. https://martinfowler.com/articles/microservices.html
[yegge] – “Stevey’s Google Platforms Rant”, Steve Yegge. Gist.github.com, 1/11/2011. https://gist.github.com/chitchcock/1281611 – a now legendary rant about platforms by a software architect that worked early on at both Google and Amazon. Steve did NOT get fired for his little “reply all” oopsie, shockingly – which tells you a lot about the positive traits of Google’s culture right there.
[dign] – “Little Things Add Up”, Larry Dignan. Baseline Magazine, 10/19/2005. http://www.baselinemag.com/c/a/Projects-Management/Profiles-Lessons-From-the-Leaders-in-the-iBaselinei500/3 – “Small teams are fast… and don’t get bogged down. … each group assigned to a particular business is completely responsible for it… the team scopes the fix, designs it, builds it, implements it and monitors its ongoing use.”
[sfowl] – “Production-Ready Microservices: Building Standardized Systems Across an Engineering Organization”, Susan Fowler. O’Reilly, 12/1/2016. ISBN-10: 1491965975, ISBN-13: 978-1491965979. Susan points out that there’s always a balance between speed and safety; the key is to start with a clear goal in mind. Her thoughts around alerts and dashboarding are very well thought out. Even better, it hits perhaps the one true weak point of microservices right on the head; the need for governance. She found it most effective to have a direct pre-launch overview with the development team going over the design on a whiteboard; within ten minutes, it will become apparent if the solution was truly production-ready. If you have only one book to read on microservices – this is it.
[conw2] – “How Do Committees Invent?”, Melvin Conway. MelConway.com, 4/1/1968. http://www.melconway.com/Home/Committees_Paper.html – The original paper as submitted by Melvin Conway. Famously the Harvard Business Review rejected Melvin’s original paper due to lack of proof; Datamation ended up publishing it in April 1968, and Fred Brook’s classic book “The Mythical Man-Month” made it famous. Rarely has such a small splash made such a big ripple.
[nacha] – “The Influence of Organizational Structure On Software Quality: An Empirical Case Study”, Nachiappan Nagappan, Brendan Murphy, Victor Basili, and Nachi Nagappan. Microsoft Research, 1/1/2008. https://www.microsoft.com/en-us/research/publication/the-influence-of-organizational-structure-on-software-quality-an-empirical-case-study/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F70535%2Ftr-2008-11.pdf – A very nice metrics-based backup to what we read in “The Mythical Man-Month”, as shown with the troubled Windows Vista release at Microsoft. Here in a recap of that disastrous release, the researchers found that the structure of the organization was the most relevant predictor of failure-prone applications – versus traditional KPIs like churn, complexity, coverage, and bug counts. We suspect that this paper and others like it influenced the decision by Microsoft to upend the structure of their program teams for Azure DevOps and Bing.
[grint] – “Splitting the organization and integrating the code: Conway’s law revisited”, Rebecca Grinter, James D. Herbsleb. ACM Digital Library, 5/22/1999. https://dl.acm.org/citation.cfm?id=302455. Interestingly, while the Nachiappan study above mentioned that globally distributed teams didn’t perform worse than collocated teams, this paper says the opposite – collocated teams are better functioning than globally distributed. It turns out that when you control for team size, both are correct: the greatest limiting factor was that old enemy, communications overhead. In other words, it doesn’t seem to matter as much if a team is collocated vs distributed, as long as we cap the size to that magical 5-12 number.
[lightst] – “The Only Good Reason to Adopt Microservices”, Vijay Gill. LightStep.com, 7/19/2018. https://lightstep.com/blog/the-only-good-reason-to-adopt-microservices/
[kimbre] – “An Interview with Jez Humble on Continuous Delivery, Engineering Culture, and Making Decisions”, Kimbre Lancaster. split.io, 8/16/2018. https://www.split.io/blog/jez-humble-interview-decisions-2018/
[fami] – “Microservices, IoT, and Azure: Leveraging DevOps and Microservice Architecture to deliver SaaS Solutions”, Bob Familiar. Apress, 10/20/2015. ISBN-10: 9781484212769, ISBN-13: 978-1484212769. The best book we’ve seen out there on IoT in the Microsoft space, by a long shot. Bob Familiar does a terrific job of explaining IoT and microservices in context.
[fowl4] – “StranglerApplication”, Martin Fowler. MartinFowler.com, 6/29/2004. https://www.martinfowler.com/bliki/StranglerApplication.html
[narum] – “Strangler Pattern”, Masashi Narumoto and Mike Wasson. Microsoft Docs, 6/22/2014, https://docs.microsoft.com/en-us/azure/architecture/patterns/strangler A good quick overview of how we can use the strangler pattern to chip away and eventually deprecate a massive legacy app. Mike Wasson in particular may be one of the best technical writers we’ve got at Microsoft.
[calca] – “Building Products at SoundCloud —Part I: Dealing with the Monolith”, Phil Calcado. Soundcloud, 6/11/2014. https://developers.soundcloud.com/blog/building-products-at-soundcloud-part-1-dealing-with-the-monolith
[hodg1] – “Azure DevOps: From Monolith to Cloud Service”, Buck Hodges. YouTube, 10/24/2017. https://www.youtube.com/watch?v=9frodP5xLxk&feature=youtu.be A nice discussion of how Azure DevOps made the switch to microservices, including maintaining consistency between an on-premises product and the hosted multi-tenant service, how they tackled that tough backend problem, and starting over with telemetry.
[hodg2] – “From Monolith to Cloud Service”, Buck Hodges. Microsoft Docs, 11/8/2017. https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/monolith-cloud-service?WT.mc_id=linkedin . Starting from a position much like Ben’s team does, with a good use of version control but little else – no telemetry, no agile or scrum, no live-site support or on-call experience, Buck walks us through turning an onprem monolith into a microservice-based, cloud-native service with Azure DevOps.
[hodg3] – “Patterns for Resiliency in the Cloud”, Buck Hodges. Microsoft Docs, 11/8/2017. https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/patterns-resiliency-cloud . Cloud native architecture really means resilient architecture, and distributed computing makes tracking down a root cause a frustrating and sometimes multi-week endeavor – yes, even with feature flags. Buck explores the Circuit Breaker originally implemented by Netflix and how it’s used with Azure DevOps to degrade gracefully, and their use of throttling as limits are approached with SQL Xevents.
[evans] – “Domain-Driven Design: Tackling Complexity in the Heart of Software”, Eric Evans. Addison-Wesley Professional, 8/30/2003. ISBN-10: 0321125215, ISBN-13: 978-0321125217. This is the gold standard, and should be required reading for anyone considering microservices – or indeed just plain well-defined systems architecture.
[driftx] – “Practical Microservices”, Dave Harrison. driftboatdave.com, 9/7/2017. https://driftboatdave.com/2017/09/07/mtx-2017-practical-microservices-directors-cut/ . The original blog post and references that influenced this chapter.
[amund] – “Microservice Architecture: Aligning Principles, Practices, and Culture”, Mike Amundsen, Matt McLarty, Ronnie Mitra, Irakli Nadareishvili. O’Reilly Media, 8/5/2016. ISBN-10: 1491956259, ISBN-13: 978-1491956250. A great discussion on Domain Driven Design in chapter 5, along with a great practical breakdown of handling one workstream and defining service boundaries using DDD of a sample company.
[lewis] – “GOTO 2015 • How I Finally Stopped Worrying and Learnt to Love Conway’s Law”, James Lewis. GOTO 2015 Chicago conference, YouTube, 7/15/2015. https://www.youtube.com/watch?v=l1tyfb5we7I There’s a few great examples where they knew the org was not capable of the change needed – and designed a system that would fit it (square peg in square hole!) instead of dictating how the design should work in a perfect, idealistic world.
[shconw] – “Randy Shoup on Microservices, the Reality of Conway’s Law, and Evolutionary Architecture”, Daniel Bryant. InfoQ, 7/3/2015. https://www.infoq.com/interviews/randy-shoup-microservices Randy uses his experience from Google and eBay to talk about why monoliths aren’t necessarily as evil as we often think they are.
[vaugh] – “Implementing Domain-Driven Design”, Vaughn Vernon. Addison-Wesley, 2/16/2013. ISBN-10: 0321834577, ISBN-13: 978-0321834577. This is the best applied and in-depth discussion we’ve seen of Eric’s groundbreaking work around decomposition and finding domain boundaries.
[newmpr] – “Principles Of Microservices”, Sam Newman. YouTube, 11/1/2015, https://www.youtube.com/watch?v=PFQnNFe27kU. Sam goes through the underlying principles behind microservices, and then attempts to resolve the tension in a core issue with microservices – how independent can they truly be as part of a whole?
[qamr] – “Using Microservices Architecture to Break Your Vendor Lock-in”, unattributed author(s). QArea, unknown date. https://qarea.com/blog/using-microservices-architecture-to-break-your-vendor-lock-in – Google is famous for buying or relying on COTS or OS libraries – but making sure that any interactions are through a shell that they can control and modify. This article discusses the negative cycle when we overrely on vendors and how it increases the fragility of our systems – and how they have broken this vendor lockin using Golang microservices.
[caval] – “Our journey to microservices: mono repo vs multiple repositories”, Avi Cavale. Shippable.com, 6/2/2016. http://blog.shippable.com/our-journey-to-microservices-and-a-mono-repository Shippable started their effort with multiple repositories, and ended up making the switch over to a single repository: “The only thing you really give up with a mono repo is the ability to shut off developers from code they don’t contribute to. There should be no reason to do this in a healthy organization with the right hiring practices. Unless you’re paranoid… or named Apple.”
[netfl1] – “Adopting Microservices at Netflix: Lessons for Architectural Design”, Tony Mauro. Nginx.com, 2/19/2015. https://www.nginx.com/blog/microservices-at-netflix-architectural-best-practices/ – A very good overview of Adrian Cockroft’s series of talks and thinking on microservices and the lessons he learned at Netflix.
[goto2014] – “GOTO 2014 • Migrating to Cloud Native with Microservices”, Adrian Cockroft. YouTube, 12/15/2014. https://www.youtube.com/watch?v=DvLvHnHNT2w – the original video on Netflix and microservices that was the source for the article above.
[nginx2014] – “Fast Delivery”, Adrian Cockcroft. Nginx, YouTube, 12/2/2014. https://youtu.be/5qJ_BibbMLw – Adrian points out that Netflix from the beginning favored a fine-grained, loosely coupled architecture. This fed into every one of the four key capabilities Adrian finds vital to deliver at scale – allowing autonomy and the freedom to innovate and make fast decisions; getting answers using big data analytics to explore alternatives and evaluate success; relying on the cloud to remove the latency around spinning up new resources; and eliminating coordination latency by folding everyone needed to deploy and support a service into a single team.
[gehan] – “Want to develop great microservices? Reorganize your team”, Neil Gehani. Mesosphere, unknown date. https://techbeacon.com/want-develop-great-microservices-reorganize-your-team – He calls a cross functional delivery team of 6-12 people a “build-and-run” team, which we kind of like.
[kimgb] – “Going big with DevOps: How to scale for continuous delivery success”, Gene Kim. TechBeacon.com, unknown date. https://techbeacon.com/going-big-devops-how-scale-continuous-delivery-success . We love the Target story because it’s one of those inspiring dumpster-fire-to-paradise redemption accounts.
[brooks] – “The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition”, Frederick P. Brooks Jr. Addison-Wesley Professional, 8/12/1995. ISBN-10: 9780201835953, ISBN-13: 978-0201835953

Chapter 7 – One Mission

[lond] – “To Build a Fire, and Other Stories”, Jack London. Reader’s Digest Association, 1/1/1994. ISBN-10: 0895775832, ISBN-13: 978-0895775832
[dweck] – “Mindset: The New Psychology of Success”, Carol Dweck. Random House, 2/28/2006. ISBN-10: 1400062756, ISBN-13: 978-1400062751
[popov] – “Fixed vs. Growth: The Two Basic Mindsets That Shape Our Lives”, Maria Popova. BrainPickings.org, 1/29/2014. https://www.brainpickings.org/2014/01/29/carol-dweck-mindset/ Love the BrainPickings site and its fabulous content.
[nigel2] – “Why are we all such hypocrites when it comes to DevOps?”, Nigel Kersten. SpeakerDeck, 10/17/2017. https://speakerdeck.com/nigelkersten/why-are-we-all-such-hypocrites-when-it-comes-to-devops – A great presentation by Nigel Kersten on impoverished communication. He covers optimism bias (which is more likely when you lack experience, believe you have more control/influence than you actually do, and think negative events are unlikely). I also love the point he makes on our own skewed view of others – that we often attribute other’s behavior/skillsets as unchangeable, whereas we excuse our own as being caused by external factors (traffic was terrible today, I’m at stress from home, etc)
[hbr] – “Up and Down the Communications Ladder”, Bruce Harriman. Harvard Business Review, 9/1/1974. https://hbr.org/1974/09/up-and-down-the-communications-ladder – The original source of the presentation by Nigel, based on a 1969 study. We’ll call out one key point – that the feedback program must not be an endcap, but product visible results.
[habit] – “The Power of Habit: Why We Do What We Do in Life and Business”, Charles Duhigg. Random House, 1/1/2014. ISBN-10: 081298160X, ISBN-13: 978-0812981605
[ohwm] – “Workplace Management”, Taiichi Ohno. McGraw-Hill Education, 12/11/2002. ISBN-10: 9780071808019, ISBN-13: 978-0071808019
[sharma] – “The DevOps Adoption Playbook: A Guide to Adopting DevOps in a Multi-Speed IT Enterprise”, Sanjeev Sharma. Wiley, 2/28/2017. ISBN-10: 9781119308744, ISBN-13: 978-1119308744
[russd] – “It Takes Dev and Ops to Make DevOps”, Russ Collier. DevOpsOnWindows.com, 7/26/2013. http://www.devopsonwindows.com/it-takes-dev-and-ops-to-make-devops/
[cumm2017] – “DevOpsDays Boston 2017 – KEYNOTE: Settlers of DevOps”, Rob Cummings. YouTube, 10/20/2017, https://www.youtube.com/watch?v=woSoQq3UkAc. The Boston 2017 keynote to DevOps Days, with the outstanding Settlers and Town Planners model. He dismantles the appallingly stupid Bimodal IT theory, and we love Rob’s very succinct and beautiful definitions of what DevOps is about: “I want to deliver customer value faster and more humanely.”
[tdoh] – “The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations”, Gene Kim, Patrick Dubois, John Willis, Jez Humble. IT Revolution Press, 10/6/2016, ISBN-10: 1942788002, ISBN-13: 978-1942788003. Chapter 16 by Steve Bella and Karen Whitley Bell is outstanding as a case study of ING Netherlands; it may be the best chapter in the entire book.
[wardl] – “On Pioneers, Settlers, Town Planners and Theft”, Simon Wardley. Gardeviance.org, 3/13/2015. https://blog.gardeviance.org/2015/03/on-pioneers-settlers-town-planners-and.html – The original source of the now famous three-phase DevOps growth model.
[teams] – “Team of Teams: New Rules of Engagement for a Complex World”, Stanley McChrystal. Portfolio, 5/12/2015. ISBN-10: 1591847486, ISBN-13: 978-1591847489
[lean] – “Lean Enterprise: How High Performance Organizations Innovate at Scale”, Jez Humble, Joanne Molesky, Barry O’Reilly. O’Reilly Media, 1/3/2015. ISBN-10: 1449368425, ISBN-13: 978-1449368425. For large enterprises attempting big-picture changes, this is the best book out there that we’ve found to date. Very pragmatic, numbers-centric and a huge influence on the contents of this book.
[bung] – “Mission Command: An Organizational Model for Our Time”, Stephen Bungay. Harvard Business Review, 11/2/2010. https://hbr.org/2010/11/mission-command-an-organizat Mission Command embraces a conception of leadership which unsentimentally places human beings at its center.
[reine] – “The Principles of Product Development Flow: Second Generation Lean Product Development”, Donald Reinertsen. Celeritas Publishing, 1/1/2009. ISBN-10: 1935401009, ISBN-13: 978-1935401001
[kimbg] “The Other Side of Innovation: Solving the Execution Challenge”, Vijay Govindarajan, Chris Trimble. Harvard Business Review, 9/2/2010. ISBN-10: 1422166961, ISBN-13: 978-1422166963
[perkin] – “Structuring for Change: The Dual Operating System”, Neil Perkin. Medium.com, 4/11/2017. https://medium.com/building-the-agile-business/structuring-for-change-the-dual-operating-system-78fa3a3d3da3
[kotte] – “Accelerate: Building Strategic Agility for a Faster-Moving World”, John P. Kotter. Harvard Business Review Press, 4/8/2014. ISBN-10: 1625271743, ISBN-13: 978-1625271747. Kotter describes here what we now call a “virtual” cross functional team, which he calls a ‘dual operating system’ – combining the entrepreneurial capability of a network with the organizational efficiency of traditional pyramid-like hierarchy, and argues that one compliments the other.
[dam41] – “You Can’t Change Culture, But You Can Change Behavior, and Behavior Becomes Culture”, Damon Edwards. DevOpsDays.org, Vimeo, 10/10/2012. http://vimeo.com/51120539 . An awesome discussion on culture change and how our behavior – and the standards we set – causes ripple effects.
[sagat] – “Why DevOps Matters: Practical Insights on Managing Complex & Continuous Change”, unattributed author(s). Saugatuck Technology, 10/1/2014. http://aka.ms/os09me A Microsoft-sponsored study that has some nice data driven insights.
[eliz] – “Change Agents of Ops: What it Takes”, Eliza Earnshaw. Puppet, 11/6/2014. http://puppetlabs.com/blog/change-agents-it-operations-what-it-takes A very punchy interview with Sam Eaton, the director of engineering operations at Yelp.
[kimx] – “How do we Better Sell DevOps?”, Gene Kim. DevOpsDays.org, Vimeo, 5/6/2013. http://vimeo.com/65548399 – A great presentation, describing the business benefits derived from DevOps.
[chamor] – “4 Ways to Create a Learning Culture on Your Team”, Tomas Chamorro-Premuzic, Josh Bersin. Harvard Business Review, 7/12/2018. https://hbr.org/2018/07/4-ways-to-create-a-learning-culture-on-your-team – Covers how leaders shouldn’t wait or be dependent on employer-provided training, but instead lead by example in demonstrating curiosity and sharing learning; reinforce positive learning behavior (including providing meaningful critical feedback), and looking for hungry minds in your interviewing process.
[woodw] – “Moving 65,000 Microsofties to DevOps with Visual Studio Team Services”, Martin Woodward, https://youtu.be/W6dqrvb-Yyw?t=4391. A fuller walkthrough of the Azure DevOps team’s transformation, start to finish.
[dora2017] – “Annual State of DevOps Report”, unattributed author(s). Puppet Labs, 2017. https://puppetlabs.com/2017-devops-report
[dora2018] – “Annual State of DevOps Report”, unattributed author(s). Puppet Labs, 2018. https://puppetlabs.com/2018-devops-report
[kissl2] – “Transforming to a Culture of Continuous Improvement”, Courtney Kissler, DevOps Enterprise Summit 2014 presentation, https://www.youtube.com/watch?v=0ZAcsrZBSlo
[forsgren] – “Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations”, Nicole Forsgren PhD, Jez Humble, Gene Kim. IT Revolution Press, 3/27/2018. ISBN-10: 1942788339, ISBN-13: 978-1942788331. We particularly enjoyed the introduction by Courtney Kissler.
[nflpd] – “Adopting Microservices at Netflix: Lessons for Team and Process Design”, Tony Mauro. Nginx, 3/10/2015. https://www.nginx.com/blog/adopting-microservices-at-netflix-lessons-for-team-and-process-design/ A very good article, covering Netflix’s use of the OODA loop in optimizing for speed versus efficiency, and creating a high-freedom, high-responsibility culture with less process.
[walkr] – “Resilience Thinking: Sustaining Ecosystems and People in a Changing World”, Brian Walker, David Salt. Island Press, 8/22/2006. ISBN-10: 9781597260930, ISBN-13: 978-1597260930
[doj1] – “DevOps Dojo”, unattributed author(s). Chef, 4/10/2018. https://blog.chef.io/2018/04/10/fulfilling-the-need-for-continuous-improvement-with-devops-dojos/
[targy3] – “DevOps At Target: Year 3”, Heather Mickman. IT Revolution, YouTube, 11/28/2016. https://www.youtube.com/watch?v=1FMktLCYukQ&app=desktop Heather describes the storming/norming/performing process we’ve seen elsewhere with successful DevOps initiatives – starting in 2012, with change agents appearing and kickstarting a grassroots DevOps transformation; then a gradual uplift as senior leaders took up the torch and provided the muscle and focus needed to build out better architecture.
[damb] – “Target CIO explains how DevOps took root inside the retail giant”, Damon Brown. EnterprisersProject.com, 1/16/2017. https://enterprisersproject.com/article/2017/1/target-cio-explains-how-devops-took-root-inside-retail-giant More on Target’s use of DevOps Dojos to overcome hurdles, from the CIO directly.
[rach] – “Target Rebuilds its Engineering Culture, Moves to DevOps”, Rachael King. Wall Street Journal, 10/19/2015. https://blogs.wsj.com/cio/2015/10/19/target-rebuilds-its-engineering-culture-moves-to-devops/ The subject of the Dojo keeps coming up as a critical catalyst in the Target use case.
[eliz] – “DevOps and Change Agents: Common Themes”, Eliza Earnshaw. Puppet, 12/3/2014. https://puppet.com/blog/devops-and-change-agents-common-themes
[srew] – “The Site Reliability Workbook”, Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, and Stephen Thorne. A terrific resource, especially the discussion in Chapter 6 on toil.
[schauso] – “Sharing our experience of self-organizing teams”, Willy Schaub. Microsoft Developer Blog, 12/2/2016. https://blogs.msdn.microsoft.com/visualstudioalmrangers/2016/12/02/sharing-our-experience-of-self-organizing-teams/ This and Brian Harry’s article below describe one of the most innovative – and insane-sounding! – team building exercises that ended up being much less disruptive, and wildly successful, than Microsoft first thought.
[bharryso] – “Self forming teams at scale”, Brian Harry. Microsoft Developer Blog, 7/24/2015. https://blogs.msdn.microsoft.com/bharry/2015/07/24/self-forming-teams-at-scale/
[bjaaso] – “Agile principles in practice”, Aaron Bjork. Microsoft Docs, 5/30/2018. https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/agile-principles-in-practice

Chapter 7 – DevOps and Leadership

[vinc2] – “DevOps and Leadership”, Ron Vincent. LinkedIn, 4/2/2018. https://www.linkedin.com/pulse/devops-leadership-ron-vincent/ This is the original source article for the section above. Ron also had an excellent post on eliminating waste that we encourage you to take the time to read; it’s one of the best (and shortest) writings we’ve seen on a very important topic.
[brit] – “Taylorism”, unattributed author(s). Encyclopaedia Britannica, unknown date. https://www.britannica.com/science/Taylorism
[neot] – “Neo Taylorism or DevOps Anti Patterns”, John Willis. IT Revolution, 10/23/2012. https://itrevolution.com/neo-taylorism-or-devops-anti-patterns
[origi] – “The Origin of Society”, unattributed author(s). Modern Matriarchal Societies, unknown date. http://mmstudies.com/top-down .
[finv] – “DevOps and Finance”, Ron Vincent. LinkedIn, 12/16/2017. https://www.linkedin.com/pulse/devops-finance-ron-vincent/ .
[liker] – “The Toyota Way to Lean Leadership: Achieving and Sustaining Excellence through Leadership Development”, Jeffrey Liker, Gary Convis. McGraw-Hill Education, 11/7/2011. ISBN-10: 0071780793; ISBN-13: 978-0071780797.
[dora2017] – “Annual State of DevOps Report”, unattributed author(s). Puppet Labs, 2017. https://puppetlabs.com/2017-devops-report
[reine] – “The Principles of Product Development Flow: Second Generation Lean Product Development”, Donald Reinertsen. Celeritas Publishing, 1/1/2009. ISBN-10: 1935401009, ISBN-13: 978-1935401001

Chapter 8 – The End of the Beginning

[lewpm] – “Project management non-best-practices”, Bob Lewis. InfoWorld, 9/26/2006. https://www.infoworld.com/article/2636977/techology-business/project-management-non-best-practices.html
[mezak] – “The Origins of DevOps: What’s in a Name?”, Steve Mezak. DevOps.com, 1/25/2018. https://devops.com/the-origins-of-devops-whats-in-a-name/ A nice overview of the beginnings of the DevOps movement, including the seminal presentations given in 2008 and 2009 by Andrew Schafer, Patrick Debois, John Allspaw, and Paul Hammond.
[net] – New English Translation of Ecclesiastes 3:22. NET Bible Noteless, Kindle edition, 8/26/2005. ASIN: B0010XIA8K
[shunryu] – “Zen Mind, Beginner’s Mind: Informal Talks on Zen Meditation and Practice”, Shunryu Suzuki. Shambhala Library, 10/10/2006. ISBN-10: 9781590302675, ISBN-13: 978-1590302675

Appendix – Aaron Bjork

[bjork] – “Agile At Microsoft”, Aaron Bjork. Microsoft Visual Studio, YouTube, 10/2/2017. https://www.youtube.com/watch?v=-LvCJpnNljU This is the best explanation I’ve seen of “The Microsoft Story”, and it’s packed with information; a must-watch.
[wang2] – “VSLive! Keynote: Abel Wang Details Microsoft’s Painful DevOps Journey”, Abel Wang. Visual Studio Magazine, 8/17/2018. https://visualstudiomagazine.com/articles/2018/08/17/abel-wang-devops.aspx. There’s a great snapshot and explanation of the bug cap in this article, as well as other background behind the MS story.

Appendix – Betsy Beyer, Stephen Thorne

[sre] – “Site Reliability Engineering: How Google Runs Production Systems”, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy. O’Reilly Media, 4/1/2016. ISBN-10: 9781491929124, ISBN- 13: 978-1491929124
[ghbsre] – “The Site Reliability Workbook: Practical Ways to Implement SRE”, Niall Murphy, David Rensin, Betsy Beyer, Kent Kawahara, Stephen Thorne. O’Reilly Media, 8/1/2018. ISBN-10: 1492029505, ISBN-13: 978-1492029502
[kieran] – “Managing Misfortune for Best Results”, Kieran Barry. SREcon EMEA, 8/30/2018. https://www.usenix.org/node/218852 . This is a great overview of the Wheel of Misfortune exercises in simulating outages for training, and some antipatterns to avoid.

Appendix – John-Daniel Trask

[rayg] – https://raygun.com/ – The official Raygun site.
[hansrag] – “Managing Errors across platforms with RayGun.io”, Scott Hanselman, John-Daniel Trask. Hanselminutes.com, 5/22/2014. https://hanselminutes.com/421/managing-errors-across-platforms-with-raygunio
[ch9rg] – “Handling billions of exceptions with .NET & Raygun.io”, John-Daniel Trask. Channel 9, 3/5/2015. https://channel9.msdn.com/Events/dotnetConf/2015/Handling-billions-of-exceptions-with-NET–Raygunio
[ch9boyd] – “DevOps at LightSpeed, lessons we learned from building a Raygun”, Jeremy Boyd, John-Daniel Trask. Channel9, 9/6/2013. https://channel9.msdn.com/Events/TechEd/NewZealand/2013/DEV302
[qzdm] – “Domino’s stock has outperformed Google, Facebook, Apple, and Amazon this decade”, Chase Purdy. Quartz, 3/22/2017. https://qz.com/938620/dominos-dpz-stock-has-outperformed-google-goog-facebook-fb-apple-aapl-and-amazon-amzn-this-decade/

Appendix – John Weers

[issurv] – IS Survivor, Bob Lewis. http://issurvivor.com/ . This is a great site John recommended that we enjoyed very much, especially on process and change management.

Appendix – Rob England

[itskept] – The IT Septic, Rob England, http://www.itskeptic.org/blog. See Rob’s great articles around DevOps and thoughts on his blog.
[garthyp] – “Gartner Hype Cycle”, unattributed author(s). Gartner, unknown date. https://www.gartner.com/technology/research/methodologies/hype-cycle.jsp

Appendix – Sam Guckenheimer

[guck2] – “DevOps at Microsoft”, Sam Guckenheimer. Microsoft Docs, 7/8/2018. https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/ A very good doc hub and overview for those who want to know ‘How Microsoft did it’, broken down by practice.

DevOps Stories – Interview with John-Daniel Trask of Raygun

The following content is shared from an interview with John-Daniel Trask, co-founder and CEO of Raygun, a New Zealand-based company that specializes in error, crash, and performance monitoring. John-Daniel (or JD) started out with repairing PCs out of college, to working as a developer, to finally starting several very successful businesses, including what became Mindscape and its very successful monitoring product, Raygun.

We covered a lot of ground here, and we think you’ll love the following thoughts:

Is a DevOps team really such a bad thing?
Why forcing your devs to go to an event booth might be a very good thing
When is a “requirement” not really a requirement?
Starting from scratch, with nothing – where would you start?
What’s the golden ticket to get funding and support for your requests and projects?

And last but not least – “it’s not the big that eat the small, it’s the fast that eat the slow!”

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

Is DevOps culture first? Well I definitely run into a lot of zealots who swing one side or another. Some people pound the table and say that DevOps is nothing about tools, that it’s all culture and fluffy stuff. These are usually the same people who think a DevOps team is an absolute abomination. Others say it’s all about automation and tooling.

Personally, I’m not black and white on it. I don’t think you can go and buy DevOps in a box; I also don’t think that “as long as we share the same psychology, we’ve solved DevOps.” Let’s take the whole idea of a DevOps team being an antipattern for example. For us it’s not that simple – it’s very easy, on a 16-person startup, to say that a DevOps team is a horrible idea. Well, of COURSE you’d think that, for you cross team communication is as easy as turning around in your chair! But let’s take a larger enterprise, 50,000 people or so, with hundreds of engineering teams. You can’t just hand down “we’re doing DevOps” as an edict and it’s solved. In that case, I have seen a DevOps team be a very successful as a template, something that helps spread the good word by example and train up individual engineering teams to adopt DevOps (this is an actual situation we saw with a top 10 sized software company and it worked very well for them).

What’s a common blind spot you see with many programmers? It’s quite shocking how little empathy there is by most software engineers for their actual end users. You would think the stereotypical heads-down programmer would be a dinosaur, last of a dying breed, but it’s still a very entrenched mindset. I sometimes joke that for most software engineers, you can measure their entire world as being the distance from the back of their head to the front of their monitor. There’s a lack of awareness and even care about things like software breaking for your users, or a slow loading site. No, what we care about is – how beautiful is this code that I’ve written, look how cool this algorithm is that I wrote.

We sometimes forget that it all comes down to human beings. If you don’t think about that first and foremost, you’re really starting off on the wrong leg.

One of the things I like about Amazon is the mechanisms they have to put their people closer to the customer experience. We try to drive that at Raygun too. We often have to drag developers to events where we have a booth. Once they’re there, the most amazing thing happens – we have a handful of customers come by and they start sharing about how amazing they think the product is. You start to see them puff out their chests a little – life is good! And the customers start sharing a few things they’d like to see – and you see the engineers start nodding their heads and thinking a little. We find those engineers come back with a completely different way of solving problems, where they’re thinking holistically about the product, about the long-term impact of the changes they’re making. Unfortunately, the default behavior is still to avoid that kind of engagement, it’s still out of our comfort zone.

Using Personas to Weed Out Red Herrings: I don’t know if we talk enough in our industry about weeding out bad feedback. We often get requests from our customers to do things like dropping a data grid with RegEx on a page. That’s the kind of request that comes from the nerdiest of the nerds – and if we were to take that seriously, think of the opportunity cost and what it would do to our own UX!

We weed out requests like this by using personas. For our application, we think in terms of either a CEO, a tech lead, or an operator. Each has their own persona and backstory, and we’ve thought out their story end to end and how they want to work with our software.

For the CXO level, the VP’s, the directors – these are people who understand their whole business hinges on the quality of their software. They need to keep this top of mind at the very top levels of decision making. For this person, there are graphs and charts showing this strategic level fault and UX information, all ready to drop into reports to the executive board. Then there’s the mid-tier – these are your tech leads, the Director of Engineering – they need to know both high level strategic 30K foot information, and a summary of key issues. The cutting edge though is that third tier, your developer or operator. This person needs to have information when something goes bump in the night. For them, you have stack traces, profiling raw data, user request waterfalls. Without that information, troubleshooting becomes totally a stab in the dark.

Lots of companies use personas, I know. They’re really critical to filter out noise and focus on a clear story that will thrill your true user base.

How can error and crash reporting make for a better performing business? Most of the DevOps literature and thinking I see focuses entirely on build pipelines, platform automation, the deployment story, and that’s the end of it. Monitoring and checking your application’s real-world performance and correcting faults usually just gets a token mention, very late in the game. But after you deploy, the story is just beginning!

I hate to say this – but I think we’re still way behind the times when it comes to having true empathy with our end users. It’s surprising how entrenched that mindset of monitoring being an afterthought or a bolt-on can be. Sometimes we’ll meet with customers and they’ll say that they just aren’t using any kind of monitoring, that it’s not useful for them. And we show them that they’re having almost 200,000 errors a day – impacting 25,000 users each day with a bad experience. It’s always a much, much larger number than they were expecting – by a factor of 10 sometimes. Yet somehow, they’ve decided that this isn’t something they should care about. A lot of these companies have great ideas that their customers love – but because the app crashes nonstop, or is flaky, it strangles them. You almost get the thinking that a lot of people would really rather not know how many problems there really are with what they’re building.

Yet time and again, we see companies that really care about their customers excel. Let’s say I take you back in time to 2008, and I give you $10,000 to invest in any company you want. Are you going to put that into Microsoft, Apple, Google, or Domino’s Pizza? Well guess what – Dominos has kicked the butt of all those big tech companies with their market cap growth rate! The answer is in their DNA – they devote all their attention into ensuring their customers have a great experience. Their online ordering experience is second to none. And that all comes from them being customer obsessive, paying attention to finding where that experience is subpar and fixing it. It’s never a coincidence that customer centric companies consistently outperform and dominate.

Source: https://www.theatlas.com/charts/S18QCJyhe

What’s forced us as an industry to change and driven a better user experience is Google, believe it or not. They started publishing a lot of research and data around application errors, performance, and prioritizing well performing sites. This democratized things that data scientists were just starting to figure out themselves. And it seemed like overnight, a lot of people cared very much that their website not be dog slow – because otherwise, it wouldn’t be on the first page results of a web search, and their sales would tank. But folks often didn’t care about performance or the end user experience – until Google forced us to.

What would you say to the company that is starting from ground zero when it comes to DevOps? I’m picturing here a shop where they take ZIP files and remote desktop onto VM’s and copy-paste their deployments. If that’s the case – I like to talk about what are the small things you could put into place that would dramatically improve the quality of life on the team. These are big impact, low cost type improvements. So where would I start?

First would come automating the deployments. Just in reliability alone, that’s a huge win. Suddenly I have real peace of mind. I can roll out releases and roll them back with a single button push, and it’s totally repeatable as a process. If I’m an oncall engineer, being able to roll out a patch through a deployment process that runs automatically at 3 a.m. is a world of difference from manually pushing assets.
The second thing I would do is set up some basic metrics with a tool like StatsD. You don’t need to allocate a person to spend several days – it’s a Friday afternoon kind of thing to start with. When you start tracking something – anything! – and put it up on the wall that’s when people start to get religion. We saw this ourselves with our product – once we put up some monitors with some of the things coming from StatsD, like the number of times users were logging in and login failures. And it was like watching an ultrasound monitor of your child. People started gathering around, big smiles on their faces – things were happening, and they felt this connection between what they were doing and their baby, out there in the big bad old world. Right away some of that empathy gap started to close.
Third would come crash reporting. There’s just no excuse not to put this into place – it takes like ten minutes, and it cuts out all that waste and thrash in troubleshooting and fuels an improvement culture.

How do we communicate in the language of business? What I wish more engineering teams understood is how to communicate in the language of business. I’m not asking developers to get an MBA in their off hours – but please TRY to frame things in terms of dollars, economic impact, or cost to the customer. Instead we say, “this shiny new thing looks like it could be helpful”. It’s no wonder engineering talent often feel like the business won’t allow them to get the tools they want – it’s like you’re speaking another language to the folks with the check book.

There’s a reason why we often have to beg to get our priorities on the table from the business. We haven’t earned the trust yet to get “a seat at the table”, plain and simple. We tend to be very maxed out, overwhelmed, and we’re pretty cavalier with our estimates around development. This reflects technology – which is fast moving, there’s so much to learn, and it’s not in a stable state. But when engineers hem and haw about their estimates or argue for prioritizing pet projects that are solely tech-driven, it makes us look unreliable as a partner in the business. We haven’t learned yet to use facts and tie our decisions into saving money or getting an advantage in the market.

Always keep this in mind – any business person can make the leap to dollars. But if you’re making an argument and you are talking about code – that’s a bridge too far. It’s too much to expect them to make that jump from code to customer to dollars. If you tell me you need React 16, that won’t sell. But if you say 10% of your customers will have a better experience because of this new feature – any business person can look at that and make the connection, that could be 5,000 customers that are now going to have a better experience. You don’t have to be Bill Gates to figure out that’s a good move!

Let’s get down to brass tacks – how do I make this monitoring data actionable? We wouldn’t think about putting planes in the air without a black box – some way of finding out after something goes wrong what happened, and why. That’s what crash monitoring is, and it’s incredibly actionable. You know the health of your deployment cycle, you can respond faster when changes are introduced that degrade that customer experience.

Let’s say you are seeing 100,000 errors a month. Once you group them by root cause, that overwhelming blizzard of problems gets cut down to size which is smaller than you’d think. You may have 1,000 distinct errors, but only 10 actual, honest-to-goodness bugs. Then you break it down by user, and that’s when things really settle out. You might find that one user is using a crappy browser extension that’s blocking half your scripts – that isn’t an issue really, and not one you can fix for them. But then there’s that one error that’s happened only 500 times – but it’s hitting 250 of your customers. That’s a different story! So you’re shifting your conversation already from how many errors you’re seeing to the actual number of customers you’re impacting – that’s a more critical number, and one that everyone from your CEO down understands. And it’s actionable. You can – and you should – take those top 2 or 3 bugs and drop it right into your dev queue for the next sprint.

This isn’t rocket science, and it isn’t hard. Reducing technical debt and improving speed is just a matter of listening to what your own application is telling you. By nibbling away on the stuff that impacts your customers the most, you end up with a hyper reliable system and a fantastic experience, the kind that can change the entire game. One company we worked with started to just take the top bug or two off their list every sprint and it was dramatic – in 8 weeks, they reduced the number of impacted customers by 96%!

Think about that – a 96% reduction in two months. Real user monitoring, APM, error and crash reporting – this stuff isn’t rocket science. But think about how powerful a motivator those kinds of gains are for behavioral change in your company. Data like that is the golden ticket you need to get support from the very top levels of your company.

One of my early mentors was Rod Drury, who founded Xero right here in Wellington, New Zealand. He says all the time: “It’s not the big that eat the small, it’s the fast that eat the slow”. That’s what DevOps is about – making your engineering team as reliably fast as possible. To get fast, you have to have a viable monitoring system that you pay close attention to. Monitoring is as close as you can get in this field to scratching your own itch.

What about building versus buying a monitoring system? I’ll admit that I’m biased on the subject, running a SAAS-based monitoring business. But I do find it head-scratching when I talk to people that are trying to build their own. I ask them, “how many people are you putting on this?” And they tell me – oh, 4 people, say a six-month project. And then I say, “what are their names?” They look at me funny, and ask why – I tell them, “I’ve had 40 people working on this for 5 years – apparently now I could fire them and hire your people!” Back in 2005, it made total sense to roll your own, since so much of the stuff we use nowadays didn’t exist. But the times have changed. Even self-hosting as its issues. Let’s say you decide to go down the ELK stack route. Well, that means running a fairly large elastic instance, which is not a set-and-forget type system. It’s a pain in the ass to manage, and it’s not a trivial effort.

To me it also is answering the wrong question. There’s one question that should be the foundation for any decision an engineering team makes: does this create value for our customer? Is our customer magically better off because we made the decision to build our own? I think – for most companies – probably building a robust monitoring system has little or nothing to do with answering that question. It ends up being a distraction, and they spend far more to get less viable information.

Etsy says “if it moves, track it.” Do you agree – should customers track everything? I’m pragmatic on this – if you’re small, tracking everything makes sense. Where it goes wrong is where the sheer amount of data clogs our decision making.

Then folks start to think about sampling data. However, what I often see is someone sitting in a chair, looking off into the distance and says – “yeah, I think about 10% of the data would give us enough”. Rarely do we see them breaking out Excel and talking about what would be statistically significant – people tend to make gut calls. Many of us have forgotten statistics, but there is a lot of really great mathematics that help you make better decisions – like calculating what a statistically significant sampling rate might be.

If you’re tracking everything you possibly could with real user monitoring for example, it can be a real thicket – a nightmare, there’s so many metric streams. You trip over your own shoelaces when something goes wrong – there’s just so much detail, you can’t find that needle in the haystack quickly. This is where you need both aggregate and raw data – to see high level aggregates and spot trends, but then be able to drill in and find out why something happened at the subatomic level. We still see too many tools out there that offer that great strategic view and it’s a dead end – you know something happened, but you can’t find out exactly what’s wrong.

Any closing thoughts? I never get tired of tying everything back to the customer, to the end user experience. It’s so imperative to everything you’re doing. There is literally no software written today for any reason other than providing value to humans. Even machine to machine, IOT systems are still supporting a human being ultimately.

Human beings are the center of the universe. But you wouldn’t know that by the way we’re treated by most of the software written for us. Great engineers and great executives grasp that. They know that to humans, the interface is the system – everything else simply does not matter in the end. So they never let anything get in the way of improving the end user experience.

References:

https://raygun.com/ – official Raygun site
https://hanselminutes.com/421/managing-errors-across-platforms-with-raygunio – May 22, 2014 podcast interview with Scott Hanselman and John-Daniel Trask on Raygun
https://channel9.msdn.com/Events/dotnetConf/2015/Handling-billions-of-exceptions-with-NET–Raygunio – March 5, 2015 Channel9 Interview with John-Daniel Trask on how Raygun handles billions of exceptions
https://channel9.msdn.com/Events/TechEd/NewZealand/2013/DEV302 – TechEd New Zealand 2013, “DevOps at LightSpeed, lessons we learned from building a Raygun”, 9/6/2013, by Jeremy Boyd, John-Daniel Trask
Dominos Pizza story and Raygun – https://qz.com/938620/dominos-dpz-stock-has-outperformed-google-goog-facebook-fb-apple-aapl-and-amazon-amzn-this-decade/

Defining DevOps Is Impossible…

Defining DevOps – what it should be, and if it should even be done – has become a surprising controversy over the past five years. The “godfather” of DevOps himself, Patrick Debois, famously resists any kind of formal definition. He thinks it comes down to culture:

The Devops movement is built around a group of people who believe that the application of a combination of appropriate technology and attitude can revolutionize the world of software development and delivery. The demographic seems to be experienced, talented 30-something sysadmin coders with a clear understanding that writing software is about making money and shipping product. More importantly, these people understand the key point – we’re all on the same side! All of us – developers, testers, managers, DBAs, network technicians, and sysadmins – are all trying to achieve the same thing: the delivery of great quality, reliable software that delivers business benefit to those who commissioned it. [debois]

This is great, but it’s hardly definitive. Just look at the Agile Manifesto for example. This gave us a definition of what Agile is (or more correctly, how it behaves) and the guiding principles behind it. Most have stood the test of time; more importantly, it’s a firm stake in the ground. We learn as much from the holes and the understressed points as we do from the things that have stuck over time. The Agile Manifesto and its underlying principles have been one of the most impactful and successful set of concepts in the 21^st century in most organizations. DevOps is very much just an extension of Agile; it’s incomprehensible to us that we should deviate from this successful model and pretend that a DevOps Manifesto should be considered impossible or too formulaic.

For us personally, we find an exact definition of DevOps in terms of what it is to be elusive. Not for the lack of trying – many very experienced and brilliant people have taken stabs at it over the years. One of our most prominent thought leaders, Gene Kim, has defined it in the past as:

The emerging professional movement that advocates a collaborative working relationship between Development and IT Operations, resulting in the fast flow of planned work (i.e., high deploy rates), while simultaneously increasing the reliability, stability, resilience and security of the production environment. [kim3]

… a very good definition; it captures the elements of partnership between dev and Ops, starts with people, and ends with the results – a fast flow of work and increased quality and stability.

Some Other DevOps Definitions

Wikipedia offers this definition:

DevOps (a clipped compound of “development” and “operations”) is a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops). The main characteristic of the DevOps movement is to strongly advocate automation and monitoring at all steps of software construction, from integration, testing, releasing to deployment and infrastructure management. DevOps aims at shorter development cycles, increased deployment frequency, more dependable releases, in close alignment with business objectives.^[i]

In “DevOps: A Software Architect’s Perspective” the authors define DevOps as:

DevOps is a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality.

Gartner has offered this definition:

DevOps represents a change in IT culture, focusing on rapid IT service delivery through the adoption of agile, lean practices in the context of a system-oriented approach. DevOps emphasizes people (and culture), and seeks to improve collaboration between operations and development teams. DevOps implementations utilize technology — especially automation tools that can leverage an increasingly programmable and dynamic infrastructure from a life cycle perspective. [gartner]

From Damon Edwards;

DevOps is… an umbrella concept that refers to anything that smooths out the interaction between development and operations. [damon]

The Agile Admin:

DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support. DevOps is also characterized by operations staff making use many of the same techniques as developers for their systems work. [agileadmin]

And from Rob England, aka the IT Skeptic – who we interviewed earlier in this book:

DevOps is agile IT delivery, a holistic system across all the value flow from business need to live software. DevOps is a philosophy, not a method, or framework, or body of knowledge, or *shudder* vendor’s tool. DevOps is the philosophy of unifying Development and Operations at the culture, system, practice, and tool levels, to achieve accelerated and more frequent delivery of value to the customer, by improving quality in order to increase velocity. [itskeptic]

Or Ken Mugrage of ThoughtWorks:

“DevOps: A culture where people, regardless of title or background, work together to imagine, develop, deploy and operate a system.” [mug2]

Putting all these definitions together, we’re starting to see a common thread around how important it is to prepare the ground and consider the role of culture. Perhaps Adam Jacobs takes a position similar to ours, saying that the exact definition may be best described by behavior:

“DevOps is a cultural and professional movement. The best way to describe devops is in terms of patterns and anti-patterns.”

… which is exactly what we’ve tried to do in our upcoming book.

Other definitions can be found in the sidebar “Some Other DevOps Definitions”. Suffice to say, there’s lots to choose from, and we won’t tell you what’s best. We can tell you our favorite though – from Donovan Brown:

“DevOps is the union of people, process, and products to enable continuous delivery of value to our end users.” [donovan]

This is exactly the right order of things and there’s not a wasted word; we can’t improve on it. This is what’s used in Microsoft as a single written definition, as it reflects what we want and value out of DevOps. Having that single definition of truth published and visible helps keep everyone on the same page and thinking more holistically.

It seems unlikely that community consensus on a single unified definition of DevOps will ever happen. The purist, engineer part of us hates this; but as time went on we realized from our research and interviews that this apparent gap was ultimately not important, and in fact was beneficial. At one conference, we remember the speakers asking the crowd how many of them were ‘doing Agile’ – about 300 hands went up, the entire audience. Then the speakers asked, a little condescendingly, “OK, now which of you are doing it right?” – and three people kept their hands up, who were then ridiculed for being bold-faced liars!

At the time we remember feeling a little shocked that so few were adhering to the stone tablets brought down from the mountain by Ken Schwaber and company. Now, we realize how shortsighted and rigid that point of view was. Agile should never have been thought of as a list of checkmarks or a single foolproof recipe. It’s likely that most of those 300 people in the audience were better off because of adopting some parts of scrum or Agile in the form of transparency and smaller amounts of work, and were building on that success. That’s far more important than ‘doing it right’.

The same holds true with DevOps and the principles behind continuous delivery. No single definition of ‘doing DevOps right’ exists, and it likely never will. What we realized in gathering information from this book was that this gap is fundamentally not important. A global definition of DevOps isn’t possible or helpful; your definition of DevOps however is VERY important. Put some thought into what DevOps means in specific terms for your specific situation, defining it, and making it explicit. Having that discussion as a group and coming up with your own definition – or piggybacking on one of the above thoughts – is time well worth spending. Over time you’ll find that the exact “what” shrinks as you focus more on the “why” and the “how” of continually improving your development processes to drive more business value and feedback quality.

…But We Do Know How It Behaves

A Manifesto is a public declaration of policy, a declaration or announcement; here we’re standing on the shoulder of the giants that came up with the brilliant Agile Manifesto – easily the most groundbreaking and impactful set of principles in the software development field in the past thirty years.

Since the Agile Manifesto was written in 2001, we’ve learned some fault points and pitfalls with Agile implementations and its guiding principles^[1]:

Strengths	Weaknesses
Processes and tools placed in second position to the makeup of the team, including direct communications and self-organization	Continuous delivery is (rightly) stressed; most development shops would ignore this with complex branching structures and rigid gates, causing long integration periods and infrequent releases.
Priorities are set by the business with regular checkpoints	“Deliver working software frequently” sabotaged by gap in covering QA/test and lack of automation/maturity in Operations, and siloed traditional org structures
Excessive documentation and lengthy requirements gathering shelved in favor of responding to change	Exploding technical debt caused by ignoring the principle “continuous attention to technical excellence and good design” in favor of velocity
Sprint retrospectives, daily scrums, and other artifacts showing accountability and transparency	Agile practices work best with small units and don’t address either epic/strategic level planning (besides “responding to change”) and how to scale effectively in large organizations. (SAFe is making great headway in addressing this)
Time boxed development periods followed by releases – the shorter the better (1-4 weeks)
High trust teams (“give motivated individuals the environment and support they need, and trust them to get the job done.”)
Simplicity – the art of maximizing the amount of work not done – addresses the recurring shame of the software industry: most features as delivered are not used or do not meet business requirements
Reflection (tuning and adjusting) key to building a learning / iteratively improving culture

For being (as of this writing) nearly 20 years old, this set of principles has weathered amazingly well. Most problems we’ve seen to date have been caused by misapplications, not with the thinking of the original architects themselves.

You’ll notice though that we didn’t stop there. However far-seeing and visionary the original signers were of the Agile Manifesto, there were some gaps exposed over the past ten years that need to be addressed. For starters, the Agile Manifesto favored individual interactions over processes and tools (ironic, since to many “Agile” has become synonymous with a tool, Version control, and a process, Daily Scrums and Retrospectives!) Agile was wildly successful in creating tightly focused development teams with a good level of trust-based interactions; the pendulum may have swung too far to the right on the fourth principle on “responding to change”. Companies have had varied success in scaling Agile beyond small working teams; we’ve seen heinous practices like 20-person drum circles and endless daily scrums combined with a complete lack of strategy – doing the wrong thing sprint by sprint with massive amounts of thrash. It is completely possible to execute Agile with a strategic vision and with a good level of planning; this is covered in more depth in a previous chapter.

This first point was more of a flaw in application by companies that misunderstood (or took too far) Agile principles. The second most dangerous flaw in the manifesto however was in what was only mentioned once and is most often overlooked – quality. Consistently, across almost all Agile implementations, we see teams struggling with the outcome of a tight focus on velocity in the form of managing technical debt. Some even have proposed adding a fourth point to the Project Management Triangle of functionality, time, and resources, something we and several others disagree with. (It tends to muddy the waters and imply that quality is a negotiating point with project managers and can be adjusted; successful teams from the days of Lean Manufacturing on build quality into the process as early as possible in the pipeline as their way of doing business; its built-in as a design factor into all project plans and time estimates.) Scrapping excessive documentation and over-specced requirements was a masterstroke; as we have seen, too many orgs have misinterpreted this as meaning “no documentation” and shortchanged QA teams and testing during project crunches, and left their software in a nonworking state for much of the development process.

The third flaw, that tight myopic focus on the development of code, is what DevOps is designed to resolve – which is why DevOps has been called “the second decade of Agile”. We’ve discussed this at length earlier, but we’ll say it again – if it isn’t running, in production, with feedback to the delivery team – it doesn’t count. Agile was meant to deliver working software to production, where continual engagement with stakeholders/product owners would fine-tune features and reduce waste due to misunderstandings. Yet it addressed only software development teams, not the critical last mile of the marathon where software releases are tested, delivered to production, and monitored.

And so, in an attempt to resolve this problem of the “last mile” – along comes DevOps, sprinting to the forefront about ten years after the Agile manifesto was written. (What will we be calling this in ten years more, I wonder?) While the exact definition of DevOps remains in flux – and likely will remain so for some time – there’s a very clear vision of the evils DevOps is attempting to resolve. Stephen Nelson-Smith put it very frankly:

“Let’s face it – where we are right now sucks. In the IT industry, or perhaps to be more specific, in the software industry, particularly in the web-enabled sphere, there’s a tacit assumption that projects will run late, and when they’re delivered (if they’re ever delivered), they will underperform, and not deliver well against investment. It’s a wonder any of us have a job at all!”

Stephen went on to isolate four problems that DevOps is attempting to solve:

Fear of change (due to a well founded belief that the platform/software is brittle and vulnerable; mitigated by bureaucratic change-management systems with the evil side effect of lengthy cycle times and fix resolution times
Risky deployments (Will it work under load? So, we push it out at a quiet time and wait to see if it falls over)
It works on my machine! (the standard dev retort once sysadmins report an issue, after a very cursory investigation) – this is really masking an issue with configuration management
Siloization – the project team is split into devs, testers, release managers and sysadmins. This is tremendously wasteful as a process as work flows in bits and dribbles (with wait times) between each silo; it leads to a “lob it over the wall” philosophy as problems/blame are passed around between “team” members that aren’t even working in the same location. This “us versus them” mentality that often results leads to groups of people who are simultaneously suspicious of and afraid of each other.

These four problems seem to be consistent and seems to hit the mark of what the DevOps movement – however we define it – is trying to solve. DevOps is all about punching through barriers – large sized, manual deployments that break, firefighting bugs that appear in production due to a messy or inefficient testing suite or mismanaged configurations and ad hoc patches, and long wait times between different siloes in a shared services org.

So, the problem set is defined. Are there common binding principles we can point to that could be as useful as the Agile Manifesto was back in the 2000’s?

The Tolstoy Principle

We keep circling back to the famous opening lines Tolstoy wrote for his masterpiece “Anna Karenina”:

“All happy families are alike; every unhappy family is unhappy in its own way.”

Just as it’s a mistake to be overly prescriptive and recipe-driven with either Agile or DevOps – it would be even worse to repeat the “scrumterfall” antipatterns we’ve seen and throw the last ten years of hard-won lessons and principles out the window because “our company is unique and special / our business won’t allow this”. Tolstoy noted a fact that applies to organizations as well as families: Happy families tend to (even unconsciously) have certain common patterns and elements, well defined roles, and follow a structure that creates the environment for success. Unhappy families tend to have a lot of variance, little discipline (or too much), great inconsistency in how rules are followed, and no introspection or learning so that things iteratively improve.

Building on this definition, and thinking back to the Tolstoy Principle (all happy families are alike), we believe there are some common traits found in happy, successful DevOps families:

Fast release cycles and continuous delivery
Cross functional small teams responsible for product end-to-end
Continual learning and customer engagement
Discipline and a high level of automation

We could be grandiose and call this the DevOps Manifesto -but of course that’s neither possible or really necessary. Let’s just call this what it is – an observation of four key principles you’ll want to include in your vision. This attempts to define DevOps by how it behaves versus a prescriptive process, and we believe it adds on the foundation laid by the simple, neat definition of DevOps we lean towards: “the union of people, process, and products to enable continuous delivery of value to our end users”.

Tolstoy here put his finger on something that applies to organizations as well as families: Happy families tend to have certain common patterns and elements, well defined roles, and follow a structure that creates the environment for success. Unhappy families tend to have a lot of variance, either too much or too little discipline, great inconsistency in how rules are followed, and no introspection or learning so that things iteratively improve.

There’s an abundance of literature and material produced on DevOps and how it has addressed the three gaps above; for us that begins with the inspirational work produced by the “Big Three” of Jez Humble, Gene Kim, and Martin Fowler. Sifting through this mountain of research, we’re humbled by the quality of thought and vast amount of heroic effort that went into completing our Agile journey and eliminating waste in delivering business value faster across our industry. We also believe each presents different facets (or from different points of view) of the four core qualities or principles covered with our DevOps Manifesto above. All happy DevOps families truly are alike.

Let’s break each principle down in more detail:

Fast release cycles and continuous delivery

This is the one KPI that we feel is consistent and tells the most about the true health of a software delivery lifecycle: how long does it take for you to release a minor change to production? This tells you your cycle time; it’s not uncommon for customer requests to be tabled for months or years as development and IT teams are buried in firefighting or a lengthy list of aging stories.

A second question is, How many releases do you deliver, on average, per month? Increasing the frequency of your production releases is the best indicator we know of that a DevOps effort is actually gaining traction. In fact, if this is the only outcome of your DevOps adoption program – that release times are reduced by 50% or more – it’s likely that you can consider your effort an unqualified success.

If you have a fast cycle time, you are living the spirit of DevOps – your teams are delivering software at a fast clip and your customers do not have to wait for unacceptable lengths of time for new value to appear, be tested, and iteratively improved (or discarded if the new feature is not successful).

If you have frequent releases to production – your releases are small, incremental. This means you can quickly isolate and resolve problems without wading through tens of thousands of bundled changes; as the team has practiced releases thousands of times including rollbacks, everyone is comfortable with the release cycle and problems both with code, integration, and threats to your jugular – the release pipeline itself – will be fixed quickly. The old antipattern of the “war room” release with late hours frantically fixing bugs and issuing emergency hotfixes will become a thing of the past.

This goes without saying but just to be clear; by “fast release cycle” we mean fast, all the way, to production. That’s the finish line. A fast release cycle to QA – where it will sit aging on a shelf for weeks or months – gains us nothing. And by “continuous delivery” we mean “no long-running branches outside of mainline”. In the age of Git and distributed development there’s room for flexibility here, but one fact has remained constant since Jez Humble’s definitive book on the subject: lengthy integration periods are both evil and avoidable. It’s perfectly acceptable and perhaps necessary to have a release branch so issues can be reproduced; long running feature branches, almost inevitably, cause much more pain than they are trying to solve. This we will also explore later; suffice to say, we haven’t yet encountered an application that couldn’t be delivered continuously, with a significant amount of automation, direct from mainline. Your developers should be checking into mainline, frequently – multiple times a day – and your testing ecosystem should be robust enough to support that.

Teams that ignore this and build their release pipeline with complex branching strategies end up incurring the wrath of the Integration Gods, whose revenge is terrible; they are afflicted with lengthy and disruptive stabilization periods where the software is left in a nonworking or unreleasable state for long periods of time as long-lived branches are painfully merged with main. The Agile Manifesto focused on delivering working software quickly in contrast to lengthy documentation; the DevOps Manifesto extends this by calling on software delivery teams to deliver that working software, to production, continuously – from mainline.

We did mention multi-month milestones as an antipattern; this ties in with our Agile DNA of favoring responding to change and ongoing customer collaboration over following lengthy waterfall-type delivery plans and hundred-page requirements documentation that ages faster than milk. Still, it’s foolish for us to throw planning out the window and pretend that we are only living in the moment; software is developed tactically but should always adhere to a strategic plan that is flexible but makes sure we are hitting the target versus reactively shifting priorities sprint to sprint. We’ll cover the planning aspects more in a later chapter.

By “fast release cycles” we are very careful not to define what that means for you, exactly. Does it really matter to your customers or business if you can boast about releasing 1,000 or 10,000 releases a day? Of course not; a count of release frequency is a terrible goal by itself, and has nothing to do with DevOps. But as an indicator, it’s a great litmus test – are our environments and release process stable enough to handle more frequent releases? Teams that are on the right track in improving their maturity level usually show it by a slow and steady increase in their release frequency. We’ll point you to Rob England’s story earlier in this book of his public sector customer, whose CIO made an increased rate of release – say every 6 weeks instead of 6 months – a singular goal for their delivery teams. A steadier cadence meant pain, which in turn forced improvement. This worked for them because in their case deployments were their pain point – as Donovan Brown is fond of saying, “If it hurts, do it more often!”

Cross functional small teams

As Amazon CTO Werner Vogels says; “you build it, you run it”.

We’ll get more into team dynamics later. Suffice to say that over the past twenty years the ideal team size has been remarkably well defined – anywhere from 8 to 12 people. Less, and the teams are often too small to do the end-to-end support they’re going to be asked to do; more and team efficiency and nimbleness drops dramatically. Jeff Bezos of Amazon is famous for quipping, “Communication is terrible!” – in the sense that too much time is wasted in large teams. The “two pizza” rule begun in Amazon – where if a team grows larger than can be fed with two pizzas, its broken up – has been applied in many medium and large-sized organizations with close to universal success.

The sticking point here for most organizations is the implications of “cross functional”. Software development teams are offshoots of corporations after all; corporations and large industries were born from the Industrial Era. The DNA that we have inherited from that time of mass production, experimental science and creative innovation worked very well for its time – including grouping specialists together in functional groups. In software development teams however, that classical organizational structure works against us – each handoff from group to group assigned to a particular task lengthens the release cycle and strangles feedback. Again, we’ll cover this in greater detail later in the book – suffice to say, there is no substitute or workable alternative we know of to having a team responsible for its work in production. Efforts to form “virtual teams” where DBA’s, architects, testers, IT/Ops and developers resolves some problems around communication but the fact that each member has a different direct report or manager – and often different marching orders – creates the seeds for failure from the get-go.

We’re well aware that asking companies to change their structure wholesale from functional groups to a vertical structure supporting an app or service end-to-end is a mammoth undertaking. Some companies have made the painful but necessary leap in a mass effort; Amazon, Netflix, and Microsoft included. If your organization has massive problems – we’re talking an existential threat to survival, the kind that ensures enthusiastic buy-in from the topmost levels – and a strong, capable army of resources, this wartime-type approach may deliver for you. (See the Jody Mulkey interview in this book for a discussion on how this kind of a pivot can be structured and driven.) But a word of caution – speak to the survivors of these kinds of massive, wrenching transformations, and they’ll often mention the bumps and bruises they suffered that in retrospect could have been avoided. Often in most enterprises the successful approach is the slow and gradual one. More on this in a later chapter.

We hate how prescriptive Agile has become – and creating unnecessary or silly rules is one mistake that we don’t want to repeat. Over the past twenty years however, software teams in practice have finally caught up to the way cross functional units are built in the military, SWAT teams and elsewhere. It does appear to be a consistent guideline and a necessary component of DevOps – small teams are better than large ones, and efficient, nimble teams are usually 8 to 12 people in size.

Why is it important though that a team handles support in production? At an Agile conference in Orlando in 2011, one presenter made a very impactful statement – he said, “For DevOps to work, developers must care more about the end result – how the application is functioning in production. And Operations must care more about the beginning.” With siloed teams and separate areas of responsibility, there’s too much in the way of customer feedback, usage and monitoring data making their way back to the team producing features. Having the team be responsible for handling bug triage and end user support removes that barrier; this can be uncomfortable but in terms of keeping the team on point and delivering true business value – and adjusting as those priorities shift – there again is no substitute. It solves the problem mentioned at that conference – suddenly developers care, very much, about how happy their user base is and how features are running in production; Operations people, by being folded into the beginning and sharing the same focus and values as the project team, will be in a much better position to pass along valuable feedback so the team will stay on target.

In our experience, we’ve found very few companies where Agile transformations have not worked – in fact, we’ve never seen one fail. This is because the scope of Agile is limited to just the development portion; limiting the scope to just one group of people who often think and value the same things alike is a good recipe for a coherent mission and success. In contrast, DevOps efforts are fraught with peril. In the past five years, we’ve seen nearly a 50% failure rate; there is inevitably very strong resistance pockets even with strong executive and leadership support. DevOps has become both a controversial word over the past decade and a very disruptive and chancy – risky – organizational challenge.

Why the resistance? Part of it is due to the cross-cutting nature of development. For DevOps to work – really work – it requires a sea change of difference in how organizations are structured. Most medium to large sized orgs have teams organized horizontally in groups by function – a team of DBA’s, a team of devs, QA, Operations/IT, a project management and BSA layer, etc. This structure was thought to improve efficiency by grouping together specialists, and each is jealously guarded by loyalists and executives intent on protecting and expanding their turf. Most successful DevOps efforts we’ve seen require a vertical organization, where teams are autonomous and cross discipline. There are exceptions – some are mentioned in case studies in this book. But even with those exceptions, their adoption of DevOps has been slower than it could have been; eliminating these siloes appears to be a vital ingredient that can’t be eliminated from the recipe.

Another reason is that we are trying to “smoosh” together two groups with diametrically opposite goals and backgrounds. Operations teams are paid and rewarded based on stability; they are focused on high availability and reliability. Reliability and stability are on opposite ends of the spectrum from where development teams operate – change. Good development teams tend to focus on cool, bleeding-edge new technology its application in solving problems and delivering features for their customers – in short, change. This change is disruptive and puts at risk the stability and availability goals that IT organizations fixate on.

A culture of learning and customer engagement

This is another leaf off of the Agile tree, in this case the branches having to do with customer collaboration and responding to change. The signers of the original Agile Manifesto intended those last two principles to correct some known weaknesses of the old Gantt-dominated long-running projects; inflexible requirements that were hard-set as a contract at the beginning of a long-running project, leading to software not matching what their business partners were asking for. Perhaps the customers were really not sure what they wanted; perhaps their business objectives changed over the months code was in development.

Many so-called “scrum masters” get lost in the different ceremonies and artifacts around Agile and Scrum, and forget the key component of continual engagement with a stakeholder. Any true Agile team uses a sprint development cycle of a very short period of time – 1 to 4 weeks, the shorter the better – where at the end a team has a review with the business stakeholder to check on and correct their work. We knew we were going to be wrong at the end of the delivery cycle and the features we deliver wouldn’t meet the customer’s expectations. That’s OK – at least we could be wrong faster, after two weeks instead of 6 months. Checking in with the customer regularly is a must-have for Agile team; in retrospect, that continual engagement became the most powerful and uplifting component of Agile development.

Keeping a learning attitude leads to blame free postmortems – the single big common point. Is it safe to make changes? Do we learn?

Discipline and a high level of automation

One of the biggest antipatterns seen with scrum and Agile was the lack of a moat – anyone can (and did) fork over a thousand bucks for a quick course, call themselves “Scrum Certified”, and put up their shingle as an Agile SDLC consultant. Of the dozens of certified scrum masters I’ve met – shockingly few have ever actually written a line of code, or handled support in any form in a large enterprise.

Thankfully, I don’t see that happening with DevOps. You just can’t separate out coding, tools, and some level of programming and automation experience from running large-scale enterprise applications in production. So, tools are important.

Wrapping It Up

The four core DevOps principles we discuss above are – we believe – fundamental to any true DevOps culture. Removing or ignoring any one of these will limit your DevOps effectiveness, perhaps crippling it from inception.

For example, having an excessively large team will make the “turn radius” of your team too wide and cut down on efficiency. If the team is not responsible for the product end to end the feedback cycles will lengthen unacceptably, and too much time will be wasted fighting turf battles, trying to shove work between entrenched siloes with separate and competing priorities, and each team member comes into the project with a different perspective and operational goals. Any DevOps effort focused on “doing DevOps” and not on reducing release cycles and continuously delivering working software is fundamentally blinded in its vision. Not having the business engaged and prioritizing work creates waste as teams waste effort guessing on correct priorities and how to implement their features. Learning type organizations are friendlier to the amount of experimentation and risk required to weather the inevitable bumps and bruises as these changes are implemented and improved on. And without automation – a high level of automation both in building and deploying releases, executing tests, supporting services and applications in production and providing feedback from telemetry and usage back to the team – the wheels begin to fall off, very quickly.

In the case of DevOps, we believe there are certain common qualities that define a successful DevOps organization, which should be the end goal of any DevOps effort. There will be no DevOps Manifesto as we have with Agile –but success does seem to look very much the same, regardless of the enterprise. All DevOps families, it turns out, are very much alike.

Other Views of DevOps

There’s been many efforts to break DevOps down into a kind of taxonomy of species, and some stand out. For example, Seth Vargo of Google broke down DevOps into five foundational pillars:

Reduce organizational siloes (i.e. shared ownership)
Accept failure as normal
Implement gradual change (by reducing the costs of failure)
Leverage tooling and automation (minimizing toil)
Measure everything

…. Which we find quite nifty, and covers the sweet spots.

A book we quite admire and quote from quite a bit is Accelerate, which lists some key capabilities, broken into five broad categories: Continuous Delivery, Architecture, Product and Process, Lean Management and Monitoring, and Culture. This is another very solid perspective on how DevOps looks in practice.

Continuous Delivery

Use VC for all production artifacts

Automate your deployment process

Continuous integration

Trunk-based development methods (fewer than 3 active branches, branches and forks having short lifetimes (<1 day), no “code loc” periods where no one can do pull requests/check out due to merging conflicts, code freezes, stabilization phases)

Test automation (reliable, devs primarily responsible)

Support test data management

Shift left on security

Continuous delivery (software in deployable state throughout lifecycle, team prioritizes this over any new work)

Includes visibility on deployability / quality to all members

The system can be deployed to end users at any time on demand

Architecture

Loosely coupled architecture; a team can test and deploy on demand without requiring orchestration

Empowered tools (I can choose my own tools)

Product and Process

Gather and implement customer feedback

Make the flow of work visible (i.e. value stream)

Work in small batches. MVP, rapid dev and frequent releases – enable shorter lead times and faster feedback loops

Foster and enable team experimentation

Lean management and monitoring

Lightweight change approval process

Monitor across application and infrastructure to inform business decisions

Check system health proactively

Improve processes and manage work with work-in-process (WIP) limits

Visualize work to monitor quality and communicate throughout the team. Could be dashboards or internal websites

Cultural

Support a generative culture (Westrum) – meaning good information flow, high cooperation and trust, bridging between teams, and conscious inquiry.

Encourage and support learning: Is learning thought of as a cost or an investment?

Support and facilitate collaboration among teams

Provide resources and tools that make work meaningful

Support or embody transformational leadership (vision, intellectual stimulation, inspirational communication, supportive leadership, personal recognition)

References

See https://theagileadmin.com/2010/10/15/a-devops-manifesto/, a very good effort but in our opinion missing a few key pieces.

See https://devops.com/no-devops-manifesto/ – Christopher Little, writing in May 2016, feels very strongly that the very idea of a DevOps Manifesto is too rule-oriented and makes the curious argument that if one existed it would prevent any kind of meaningful dialogue. It’s an interesting position by a good writer, without much if any supporting evidence.

[debois] – http://www.jedi.be/blog/2010/02/12/what-is-this-devops-thing-anyway/

[kim3] – The top 11 things you need to know about DevOps’, Gene Kim

[donovan] – http://donovanbrown.com/post/what-is-devops

[do2] – (http://dev2ops.org/2010/02/what-is-devops/)

[agileadmin] – (https://theagileadmin.com/what-is-devops/)

[garntner] – (https://www.gartner.com/it-glossary/devops)

[itskeptic] – Rob England, “Define DevOps: What is DevOps?” – 11/29/2014, http://www.itskeptic.org/content/define-devops

[mug2] – Ken Mugrage, “My Definition of DevOps”, https://kenmugrage.com/2017/05/05/my-new-definition-of-devops/#more-4 – Note, Ken seems to agree with us that a one-size-fits-all definition isn’t of value: “It’s not important that the “industry” agree on a definition. It would be awesome, but it’s not going to happen. It’s important that your organization agree (or at least accept) a shared definition.”

[sre] Change Management SRE has found that roughly 70% of outages are due to changes in a live system. Best practices in this domain use automation to accomplish the following: Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise This trio of practices effectively minimizes the aggregate number of users and operations exposed to bad changes. By removing humans from the loop, these practices avoid the normal problems of fatigue, familiarity/ contempt, and inattention to highly repetitive tasks. As a result, both release velocity and safety increase.

	sdorsett on GitHub Copilot and App Mo…
	Roy K on DevOps Stories – an Interview…
	jweers on Thriving in a time of cha…
	jweers on Thriving in a time of cha…
	Jess’s Unfiltered on Thriving in a time of cha…