I was at a conference earlier this week and we got some outstanding questions about how Microsoft went about their transformation – especially with the Azure DevOps team. I want to build on this with a followup post going into more depth on our use of culture and automation – but here’s a good place to start with some great links.
Question #1 – How do we handle planning on a strategic level with the more tactical focus of Agile?
- Using features and epics – I love this – https://docs.microsoft.com/en-us/azure/devops/boards/backlogs/define-features-epics?view=vsts&tabs=new-nav
- How we handle epic level planning at Microsoft – Cloud9 Donovan Brown interview with Aaron Bjork, 34 minute video. Excellent.
- Note there’s a relatively new feature in Azure DevOps to report across teams called Delivery Plans. More information on this and how to set up reporting is here. A detailed walkthru on implementation is here.
Question #2 – How did Microsoft go about their transformation to DevOps from a shared services model?
- See my interview with Aaron Bjork, and the outstanding YouTube video he did in the footer. One of the best 45 min videos on the topic I’ve ever seen – anywhere. https://driftboatdave.com/2018/05/30/devops-stories-aaron-bjork-microsoft/
- A great article on “What is Agile” from Aaron Bjork. https://docs.microsoft.com/en-us/azure/devops/learn/agile/what-is-agile
- Our best DevOps “how and why” articles are by Sam Guckenheimer – great guy – see this central hub for more. https://docs.microsoft.com/en-us/azure/devops/learn/what-is-devops
Question #3 – What about testing? (This is usually one of our biggest blockers to improve release reliability and velocity – an unreliable, flaky test layer)
- Great interview page with Munil – https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/shift-left-make-testing-fast-reliable
- Also on Eliminating Flaky Tests (we want red to mean red) and our unabashed use of testing in production. This caused great angst for our developers but ended up being the single biggest contributor to our success.
- Our branching and release strategy. We did experiment with GitHub Flow but found it didn’t work in our specific case. We do work off of trunk – long-lived feature branches (>1 day) are verboten.
- We like rotating “F” and “L” teams so some of your people are handling direct (livesite) report, others are focused on development. Note that this doesn’t mean 100% of all support calls hit devs directly – it might just be 5% – but devs must share some operational support for DevOps to work. And you want to tune your alerts so if someone is getting woken up at 5 am, there’s a damn good reason – the Google SRE book and “Practical Monitoring” goes into more detail. See Aaron bjork’s overall presentation I mentioned before, or this page for some great videos on our move to a livesite culture.
- We also shifted left on security. Here’s a good walkthru.
- How we moved from a monolith to cloud-based microservices, by Buck Hodges. He says point-blank that moving to the cloud the way MSFT did – lifting and shifting – was a big risk, like jumping off a cliff. Looking back, it ended up being the best way forward.
- Using feature flags and release rings to control the blast radius.
Question #4 – Production Support. Let’s say we have an Agile team, 8-12 people. How the heck are we supposed to do global support across multiple regions, 24x7x365 in production?
- Short answer – the only way this will work is if you 1) make sure you’re only supporting a small sliver of functionality, 2) that you gate the support demands upon your devs so it’s <50% of their time i.e. the SRE model. More than likely you’re going to have some operational support – even offshore or 3rd party – handled externally to the team. 3) alerts are tuned so that only truly important things make it through. I talk about this extensively in my book; the books “The Art of Monitoring” and “Practical Monitoring” also elaborate on this.