Nigel came to Puppet from Google HQ in Mountain View, where he was responsible for the design and implementation of one of the largest Puppet deployments in the world. At Puppet, Nigel was responsible for the development of the initial versions of Puppet Enterprise and has since served in a variety of roles, including head of product, CTO, and CIO. He’s currently the VP of Ecosystem Engineering at Puppet. He has been deeply involved in Puppet’s DevOps initiatives, and regularly speaks around the world about the adoption of DevOps in the enterprise and IT organizational transformation.
Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in mid 2019 and available now for pre-order!
The Deep End of the Pool
I grew up in Australia; I was lucky enough to be one of those kids that got a computer. It turns out that people would pay me to do stuff with them! So I ended up doing just that – and found myself at a local college, managing large fleets of Macs and handling a lot of multimedia and audio needs there. Very early in my career, I found hundreds of people – students and staff – very dependent on me to be The Man, to fix their problems. And I loved being the hero – there’s such a dopamine hit, a real rush! The late nights, the miracle saves – I couldn’t get enough.
Then the strangest thing happened – I started realizing there was more to life than work. I started getting very serious about music, to the point where I was performing. And I was trying a startup with a friend on the side. So, for a year or two, work became – for the first time – just work. Suddenly I didn’t want to spend my life on call, 24 hours a day – I had better things to do! I started killing off all my manual work around infrastructure and operations, replacing it with automation and scripts.
That led me to Google, where I worked for about five years. I thought I was a scripting and infrastructure ninja – but I got torn to shreds by the Site Reliability Engineers there. It was a powerful learning experience for me – I grew in ways I couldn’t have anywhere else. For starters, it was the deep end of the pool. We had a team of four managing 80,000 machines. And these weren’t servers in a webfarm – these were roaming laptops, suddenly appearing on strange networks, getting infected with malware, suffering from unreliable network connections. So we had to automate – we had no choice about it. As an Ops person, this was a huge leap forward for me – it forced me to sink or swim, really learn under fire.
Then I left for Puppet – I think I was employee #13 there – now we’re at almost 500 and growing. I’m the Chief Technical Strategist, but that’s still very much a working title – I run engineering and product teams, and handle a lot of our community evangelism and architectural vision. Really though it all comes down to trying to set our customers up for success.
I don’t think our biggest challenge is ever technical – it’s much more fundamental than that, and it comes down to communication. There’s often a real disconnect between what executives think is true – what they are presenting at conferences and in papers – and what is actually happening on the ground. There’s a very famous paper from the Harvard Business Review back in the 70’s that said that communication is like water. Communication downwards is rarely a problem, and it works much better than most managers realize. However, open and honest communication up the chain is hard, like trying to pump water up a hill. It gets filtered or spun, as people report upwards what their manager wants to believe or what will reflect well on them – and next thing you know you have an upper management layer that thinks they are well informed but really is in an echo chamber. Just for example, take the Challenger shuttle disaster – technical data that clearly showed problems ahead of the explosion were filtered out, glossed over, made more optimistic for senior management consumption.
We see some enterprises out there struggling and it becomes this very negative mindset – “oh, the enterprise is slow, they make bad decisions, they’re not cutting edge.” And of course that’s just not true, in most cases. These are usually good people, very smart people, stuck in processes or environments where it’s difficult to do things the right way. Just for example, I was talking recently to some very bright engineers trying to implement change management, but they were completely stuck. This is a company that is about 100,000 people – for every action, they had to go outside their department to get work done. So piecemeal work was killing them – death by a thousand cuts.
Where To Start
In most larger enterprises aiming for complete automation, end to end, is somewhat of a pipe dream – just because these companies have so many groups and siloes and dependencies. But that’s not saying that DevOps is impossible, even in shared services type orgs. This isn’t nuclear science, it’s like learning to play the piano. It doesn’t require brilliance, it’s not art – it’s just hard work. It just takes discipline and practice, daily practice.
I have the strong impression that many companies out there SAY they are doing DevOps, whatever that means – but really it hasn’t even gotten off the ground. They’re still on Square 1, analyzing and trying to come up with the right recipe or roadmap that will fit every single use case they might encounter, past present and future. So what’s the best way forward if you’re stuck in that position?
Well, first off, how much control do you have over your infrastructure? Do you have the ability to provision your VM’s, self-service? If so you’ve got some more cards to play with. Assuming you do – you start with version control. Just pick one – ideally a system you already have. Even if it’s something ancient like Subversion – if that’s what you have, use it as your one single source of truth. Don’t try to migrate to latest and greatest hipster VC system. You just need to be able to programmatically create and revert commits. Put all your shell scripts in there and start managing your infrastructure from there, as code.
Now you’ve got your artifacts in version control and you’re using it as a single repository, right? Great – then talk to the people running deployments on your team. What’s the most painful thing about releases? Make a list of these items, and pick one and try to automate it. And always prioritize building blocks that can be consumed elsewhere. For example, don’t attempt to start by picking a snowflake production webserver and trying to automate EVERYTHING about it – you’ll just end up with a monolith of infrastructure code you can’t reuse elsewhere, your quality needle won’t budge. No, instead you’d want to take something simple and in common and create a building block out of it.
For example, time synchronization – it’s shocking, once you talk to Operations people, how something so simple and obvious as a timestamp difference between servers can cause major issues – forcing a rollback due to cascading issues or a troubleshooting crunch because the clocks on two servers drifted out of synch and it broke your database replication. That’s literally fixed in Linux by installing a single package and config. But think about the reward you’ll get in terms of quality and stability with this very unglamorous but fundamental little shift.
Take that list and work on what’s causing pain for your on-call people, what’s causing your deployments to break. The more you can automate this, the better. And make it as self-service as possible – instead of having the devs fire off an email to you, where you create a ticket, then provision test environments – all those manual chokepoints – wouldn’t it be better to have the devs have the ability to call an API or click on a website button and get a test environment spun up automatically that’s set up just like production? That’s a force multiplier in terms of improving your quality right at the get-go.
Now you’ve got version control, you can provision from code, you can roll out changes and roll them back. Maybe you add in inventory and discoverability of what’s actually running in your infrastructure. It’s amazing how few organizations really have a handle on what’s actually running, holistically. But as you go, you identify some goals and work out the practices you want to implement – then choose the software tool that seems the best fit.
Continuous Delivery Is The Finish Line
The end goal though is always the same. Your target, your goal is to get as close as you can to Continuous Integration / Continuous delivery. Aiming for continuous delivery is the most productive single thing an enterprise can do, pure and simple. There’s tools around this – obviously working for Puppet I have my personal bias as to what’s best. But pick one, after some thought – and play with it. Start growing out your testing skills, so you can trust your release gates.
With COTS products you can’t always adopt all of these practices – but you can get pretty close, even with big-splash, multi-GB releases. For example, you can use deployment slots and script as much as you can. Yes, there’s going to be some manual steps – but the more you can automate even this, the happier you’ll be.
Over time, kind of naturally, you’ll see a set of teams appear that are using CI/CD, and automation, and the company can point to these as success stories. That’s when an executive sponsor can step in and set this as a mandate, top down. But just about every DevOps success story we’ve seen goes through this pioneering phase where they’re trying things out squad by squad and experimenting – that’s a good thing. You can’t skip this, no more than a caterpillar can go right to being a butterfly.
At first I really hated the whole DevOps Team concept – and in the long term, it doesn’t make sense. It’s actually a common failure point – a senior manager starts holding this “A” team up as an example. This creates a whole legion of haters and enemies, people working with traditional systems who haven’t been given the opportunity to change like the cool kids – the guys always off at conferences, running stuff in the cloud, blah blah. But in the short term it totally has its place. You need to attach yourself to symbols that makes it clear you’re trying to change. If you try to boil the ocean or spin it out with dozens of teams, it gets diluted and your risk rises, it could lose credibility. Word of mouth needs to be in your favor, kind of like band t-shirts for teenagers. So you can start with a small group initially for your experiments – just don’t let it stay that way too long.
But what if you DON’T have that self-provisioning authority? Well there’s ways around that as well. You see departments doing things like doing capacity planning and reserving large pools of machines ahead of time. That’s obviously suboptimal and it’s disappearing now that more people are seeing what a powerful game-changer the cloud and self-provisioned environments are. The point is – very rarely are we completely shackled and constrained when it comes to infrastructure.
Automation and Paying Off Technical Debt
It’s all too easy to get bogged down in minutiae when it comes to automation. I said earlier that DevOps isn’t art, it’s just hard work – and that’s true. But focus that hard work on the things that really matter. Your responsibility is to make sure you guard your time and that of the people around you. If you’re not careful, you’ll end up replacing this infinite backlog of manual work you have to do with an infinite amount of tasks you need to automate. That’s really demoralizing, and it really hasn’t made your life that much better!
Let’s take the example of a classic three-tier web app you have onprem. And you’ve sunk a lot of time into it so that now it fails every week versus every 6 months – terrific! But for that next step – instead of trying to automate it completely end to end, which you could do – how could you change it so that its more service oriented, more loosely coupled, so your maintenance drops even more and changes are less risky? Maybe building part of it as a microservice, or putting up that classic Martin Fowler strangler fig, will give you this dramatic payoff you would never get with grinding out automation for the sake of automation and never asking if there’s a better way.
Paying off technical debt is a grind, just like paying off your credit card and paying off the mortgage. Of course you need to do that – but it shouldn’t be all you do! Maybe you’ll take some money and sink it into an investment somewhere, and get that big boost to your bottom line. So instead of mindlessly just paying off your technical debt, realize you have options – some great investment areas open to you – that you can invest part of your effort in.
Optimism Bias and Culture
This brings us right back to where we started, communication. There is a fundamental blind spot in a lot of books and presentations I see on DevOps, and it has to do with our optimism bias. DevOps started out as a grassroots, community driven movement – led and championed by passionate people that really care about what they’re doing, why they’re doing it. Pioneers like this are a small subset of the community though – but too often we assume ‘everyone is just like us’! What about the category a lot of people fall in – the ones who just want to show up, do their job, and then go home? If we come to them with this crusade for efficiency and productivity, it just won’t resonate with the 9 to 5 crowd. They like the job they have – they do a lot of manual changes, true, but they know how to do it, it guarantees a steady flow of work and therefore income, and any kind of change will not be viewed as an improvement – no matter how you try to sell it. You could call this “bad”, or just realize that not everyone is motivated by the same things or thinks the same way. In your approach, you may have to mix a little bit of pragmatism in with that DevOpsy-starry eyed idealism – think of different ways to reach them, work around them, or wait for a strong management drive to collapse this kind of resistance.