The following content is shared from an interview with Sam Guckenheimer, product owner for Visual Studio Team Services. When people ask us “how Microsoft did it” with our DevOps transformation, we often think of the lessons Sam shared with us during our talk. There is so much to learn from here that can help other companies in making their own journey to better, faster, and safer delivery of value!
These and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!
One thing I want to start with – it really annoys me when I read grandiose claims that DevOps is broken in some way. We know that’s just not the case – Gartner tells us at least half of enterprises have something going on with DevOps and they all want to do more. If you look at Agile, which began with the Agile Manifesto back in 2001 – and compare it with where it was as a movement a decade later in 2011 – well, that would look very much like where we are at today, about 10 years after DevOps first began as a concept back in 2009. The trends are really clear, and our success rate and the maturity of the tools and processes part is only going to go up.
Avoid Massive Reorgs: It’s just not true when some say you have to “blow up the organization” to make DevOps work. Change is necessary – you have to get rid of all the handoffs, the waste, and really follow the Lean model with disintermediating developers from production and from customers. But that doesn’t mean you need to make drastic moves and that’s not how we did it at Microsoft. It can be done in an evolutionary way.
Most companies don’t have the luxury of saying, “let’s blow it up” and just jettison decades of code with their legacy applications and start over. That’s your lifeblood! I know that was true with us on the Visual Studio team; we had to go about things in a very gradual way so we didn’t threaten the jugular of our company.
Find Your North Star: Six years ago we found our North Star – how we wanted to go about delivering value using the DevOps mindset – and we pointed to it, saying – “we want to be a world class engineering organization”. Everything we’ve done since then, every major decision we’ve made, has been built around measuring our progress towards that mission.
Jez Humble has joked a few times about some companies trying to “sprinkle magical microservices fairy dust” over things to magically get cloud services architecture. I have to say – there was no fairy dust for us. It required progressive change, some very conscious hard engineering changes, and walking the walk.
Just for example, overhauling our test portfolio and moving to Git took three years. We kept deprecating and replacing older, slow tests with the faster ones incrementally – sprint by sprint, test by test. Now it takes us about 7 minutes to run 70K unit tests before a developer commits to master. But the value is incredible for us – before that, we had these long-running integration tests that had never run completely green, that always required manual intervention and was killing our release flow.
Everything – our refactoring from monolith to microservices, our safe deployment practices, building a lifecycle culture, even our datacenter standup automation – required a lot of work and a multi-year commitment, persistence despite setbacks. We knew though where the “North Star” was and we were committed. Our approach was – set the goal, measure the progress, and keep going until we get there.
Production Support: Shifting to a production support mindset was a big change and of course not everyone was onboard, especially at first. We knew that would be our most important and critical win – making sure the delivery teams were onboard and happy with what was going on. We measured this as one of our first KPI’s. We would do regular surveys of engineering satisfaction and go into depth about their jobs, how tooling was supporting their jobs, the process was supporting their jobs – and what we saw was a steady rise in satisfaction.
Just for example on this, one of the things we measured was alerting frequency – are we getting to the right person the first time? That’s something we are always watching – if you’re waking people up at 2 in the morning, it had better be the right person. We needed to make sure that we are paying attention to the things that matter to people’s lives and their satisfaction with their jobs.
When you’re genuine, you get a genuine response. This all helps build that high-trust culture that Gene Kim and others have emphasized as key.
The concept of servant leadership has been a big part of our change; good managers care about their team and look for ways to make their jobs easier. That’s the Andon cord philosophy – anyone on the floor can pull that cord, stop the line if needed – and the manager comes over, the root cause is identified and rolled into the process so future incidents don’t happen. So in our case – we don’t close out livesite incidents for example until the fix is identified and in the backlog so it won’t happen again.
Setting Goals and Metrics: Our North Star remains fixed but we are always redefining how we want to get there. Every 6 months we select, epic by epic, 3 or 4 goals that define success for us over the next six months, and the specific metrics that will define them. We publish these and they’re flowed all the way up the management chain. Those goals and metrics on an epic level don’t change for those six months. Each person on the feature crews know which epics they’re working on and can ask each sprint – what are the next few things we need to do to move the needle along these goals? They look ahead about 3 sprints in terms of what they’re trying to do – no more than that. That level of planning is key for us to make progress in an iterative way and minimize disruption.
In the beginning, we thought it was really not a big deal to figure out the metrics and focus on the right thing and so forth. It turns out that finding the right metrics is as complicated as designing the right feature. It’s really not obvious what in terms of measurement and what you’re striving for. Very frequently, you don’t have an out of the box way of doing the telemetry – so you need to instrument for the business API’s you want.
A really clear example on this – one of the metrics that we’re interested in is, how many developers are working on projects that are doing continuous delivery to Azure? That’s a very hard thing to count. You have to make several leaps of instrumentation and joins in order to answer that. Asking the question clearly and getting a way of gathering data on it is a real engineering problem – and one that typically is made to sound much simpler and less of an obstacle than it really is on the web or in books about lean customer analytics.
This goes way, way beyond your standard # of site visitors or simple generic use cases for a website. Until you start getting down to brass tacks and define what the things are that we care about as a business and why – it’s difficult to appreciate how challenging it is to come up with the right measurables.
Value Stream Mapping: I’m going to shock you a little here – we don’t do normal value stream mapping here. My observation is that value stream mapping is really effective when you want to get people on the same page and get some momentum going towards a DevOps movement. Once you show people – wow, it takes us 60 days to get something to production, and most of that is wait time – 5 days for approval here, 7 days for testing here – that’s great to get everyone to see the elephant in the room. It never fails to shock people once they see how huge that bucket of idle time is!
For us, we’re past that initial shock phase. We focus heavily on all the things that value stream mapping attacks in terms of handoffs, idle time versus process time, etc – but it is definitely not something you need to do on an ongoing basis, in my opinion.
Two Key Antipatterns: I see two key failings that sometimes trips organizations up. First, people often think in terms of formulas – you need to do X with the people, Y with the process, and Z with the tools – and think of each of these as being independent pillars, that you can tackle one at a time in phases. It ends up being counterproductive, making things more complicated and lengthening things, because in reality all these things are interrelated and need to be thought of together.
My advice is to fight the tendency to take a single practice, however good, and try to implement it in isolation. Think in terms of all three columns as supporting a single building together; each improvement should touch on people, process, and tools in some way and make it a little better. Focus on the quick wins – try to stairstep your maturity, building something small that quickens that release cycle and delivers feedback faster.
The second antipattern is not getting the right balance of leadership and delegation. You need to have obvious skin in the game from leadership, and initiative from individual practitioners. Think back to that great book “Drive” by Dan Pink, which stressed the leadership value between Autonomy, Mastery, and Purpose. You are going to need to spark people and get them enthusiastic, active, and feeling like they control their destiny – autonomy.
It’s really part art and science, because that autonomy has to be balanced with purpose, which is driven consistently and forcefully by management. And if you look at most of the current execs at Microsoft, you will see that they practice both high empathy and engage deep technically.
Mission is key for us but it goes beyond just a few words or a slogan. We put up guardrails, very clear rules of the road that specifies “here is what you need to do to check your code into master.” We have a very clear definition of done that is common in every team – “code delivered with tests and telemetry and deployed in production worldwide.”
This is the exact opposite of “it works on my machine” – and everyone knows it. If you’re doing new work, there’s a set of common services we provide, including sample code and documentation. So no one has to reinvent the wheel when it comes to telemetry for example – you might improve on it, but you would never have to deliver this from scratch, it’s reused from a common set of services.