DevOps Stories – an Interview with Ryan Comingdeer of Five Talent Software


Ryan Comingdeer is the CTO of Five Talent Software, a thriving software consultancy firm based in Oregon with a strong focus on cloud development and architecture. Ryan has 20 years of experience in cloud solutions, enterprise applications, IoT development, website development and mobile apps.  

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

Obstacles in implementing Agile: Last week I was talking to a developer at a large enterprise who was boasting about their adoption of Agile. I asked him – OK, that’s terrific – but how often do these get out the door to production? It turns out that these little micro changes get dropped off at the QA department, and then is pushed out to staging once a month or so… where it sits, until it’s deemed ready to release and the IT department is ready – once a quarter. So that little corner was Agile – but the entire process was stuck in the mud. 

The first struggle we often face when we engage with companies is just getting these two very different communities to talk to one another. Often its been years and years of the operations department hating on the development team, and the devs not valuing or even knowing about business value and efficiency. This is hard work, but understanding that philosophy and seeing the other side of things is that vital first step. 

I know I’ve won in these discussions – and this may be 12 meetings in – when I can hear the development team agreeing to the operations teams goals, or an Operations guy speaking about development requirements. You have to respect each other and view work as a collaborative effort.  

For the development teams, often they’re onboard with change because they recognize the old way isn’t working. Often times the business throws out a deadline – ‘get this done by April 1st‘ – and when they try to drill into requirements, they get an empty chair. So they do the best they can – but there’s no measurable goals, no iterative way of proving success over an 18 month project. So they love the idea of producing work often in sprints – but then we have to get them to understand the value of prototyping, setting interim deliverables and work sizing.  

Then we get to the business stakeholders, and have to explain – this is no longer a case where we can hand off a 300-page binder of requirements and ask a team to ‘get it done’. The team is going to want us involved, see if we’re on the right track, get some specific feedback. Inevitable we get static over this – because this seems like so much more work. I mean, we had it easy in the old days – we could hand off work, and wait 12 months for the final result. Sure the end result was a catastrophic failure and everybody got fired, but at least I wasn’t hassled with all these demos and retrospectives every two weeks! That instant feedback is really uncomfortable for many leaders – there’s no insulation, no avoidance of failure. It does require a commitment to show up and invest in the work being done as it’s being done. 


Retrospectives for me are one of the best things about Agile. I wish they were done more often. We do two, one internally, then a separate one with the customer so we’re prepared – and we’re upfront, here’s where we failed, here’s the nonbillable time we invested to fix it. You would think that would be really damaging, but we find it’s the opposite. The best thing a consulting company can do is show growth, reviewing successes and failures directly and honestly to show progress. Our relationships are based on trust – the best trust building exercise I’ve seen yet is when we admit our failure and what we’re going to do to fix it. I guarantee you our relationship with the customer is tighter because of how we handled a crisis – versus attempting to hide, minimize, or shift blame.  

Implementing DevOps: It’s very common that the larger organizations we work with aren’t sure of where to start when it comes to continuous integration or CD. Where do I begin? How much do I automate? Often it comes down to changing something like checking in a new feature say after two weeks of work. That’s just not going to cut it – what can we deliver in 4 hours?  

That being said, CI/CD is Step 1 to DevOps; it’s fundamental. Infrastructure as Code is further down the list – it takes a lot of work, and it’s sometimes hard to see the value of it. Then you start to see the impact with employee rotation and especially when you have to rollback. And think about how much easier it makes it when you have to rollback changes – you can see what was changed and when; without it, you might be stuck and have to fix a problem in place. The single biggest selling point for Infrastructure as Code is security; you can demonstrate what you’re doing to regulate environments, you can show up to an audit prepared with a list of changes, who made them and what they were, and a complete set of security controls.   

A True MVP: Most of the companies we work with come to us because they’ve got a huge backlog of aging requests, these mile-long wish lists from sales and marketing teams. We explain the philosophy behind DevOps and the value of faster time to market, small iterations, and more stable environments and a reliable deployment process. Then we take those huge lists of wishes and break them down into very small pieces of work, and have the business prioritize them. There’s always one that stands out – and that’s our starting point. 

The first sprint is typically just a proof of concept of the CI/CD tools and how they can work on that top #1 feature we’ve identified. The development team works on it for perhaps 2 days, then sysops takes over and uses our tooling to get this feature into the sandbox environment and then production. This isn’t even a beta product, it’s a true MVP – something for friends and family. But it’s an opportunity to show the business and get that feedback that we’re looking for – is the UI ok? How does the flow look? And once the people driving business goals sit down and start playing with the product on that first demo, two weeks later, they’re hooked. And we explain – if you give us your suggestions, we can get them to staging and then onto production with a single click. It sells itself – we don’t need long speeches.  

The typical reaction we get is – “great, you’ve delivered 5% of what I really want. Come back when it’s 100% done.” And the product is a little underwhelming. But that’s because we’re not always sticking to the true definition of a minimum viable product (MVP). I always say, “If an MVP is not something you’re ashamed of, it’s not a MVP!” Companies like Google and Amazon are past masters at this – they throw something crude out there and see if it sticks. It’s like they’re not one company but 1,000 little startups. You’ve got to understand when to stop, and get that feedback. 

I’ve seen customers go way down in the weeds and waste a ton of money on something that ends up just not being viable. One customer I worked with spent almost $250K and a year polishing and refactoring this mobile app endlessly, when we could have delivered something for about $80K – a year earlier! Think of how the market shifted in that time, all the insights we missed out on. Agile is all about small, iterative changes – but most companies are still failing at this. They’ll make small changes, and then gate them so they sit there for months. 

When we start seeing really progress is when the product is released ahead of deadline. That really captures a lot of attention – whoa, we wanted this app written in 15 months, you delivered the first version in two weeks – nine months in we can see we’re going to be done 4 months early because of our cadence.  

So here’s my advice – start small. Let me give you one example – we have one customer that’s a classic enterprise – they’ve been around for 60 years, and it’s a very political, hierarchical climate, very waterfall oriented. They have 16 different workloads. Well, we’re really starting to make progress now in their DevOps transformation – but we never would have made it if we’d tried this all-in massive crusade effort. Instead, we took half of one workload, as a collection of features and said – we’re going to take this piece and try something new. We implemented Agile sprints and planning, setup automated infrastructure, and CI/CD. Yeah, it ruffled some feathers – but no one could argue with how fast we delivered these features, and how much more stable they were, and how happy the customers were because we involved them in the process.  

The biggest problem we had was – believe it or not – getting around some bad habits on having meetings for the sake of having meetings. So we had to set some standards – what makes for a successful meeting? What does a client acceptance meeting look like?

Even if you’re ‘just a developer’, or ‘just an ops guy’, you can create a lot of change by the way you engage with the customer, by documenting the pieces you fill in, by setting a high standard when it comes to quality and automation.  

Documentation: I find it really key to write some things down before we even begin work. When a developer gets a two week project, we make sure expectations are set clearly in documentation. That helps us know what the standards of success are, gets QA on the same page – it guides everything that we do.  

I also find it helps us corral the chaos caused by runaway libraries. We have a baseline documentation for each project that sets the expectation of the tools we will use. Here, I’ll just say – it’s harder to catch this when you’re using a microservice architecture, where you have 200 repos to monitor for the Javascript libraries they’re choosing. Last week, we found this bizarre PDF writer that popped up – why would we have two different PDF generators for the same app? So we had to refactor so we’re using a consistent PDF framework. That exposed a gap in our documentation, so we patch that and move on. 

Documentation is also a lifesaver when it comes to onboarding a new engineer. We can show them the history of the project, and the frameworks we’ve chosen, and why. Here’s how to use our error logging engine, this is where to find Git repos, etc. It’s kept very up to date, and much of it is customer facing. We present the design pattern we’ll be using, here’s the test plans and how we’re going to measure critical paths and handle automated testing. That’s all set and done before Day 1 with the customer so expectations are in line with reality. 

We do use a launch checklist, which might cover 80% of what comes up – but it seems like there’s always some weird gotchas that crop up. We break up our best practices by type –for our Microsoft apps, IOT, monoliths, or mobile – each one with a little different checklist.  

It’s kind of an art – you want just the right amount, not too much, not too little. When we err, I think we tend to over-document. Like most engineers, I tend to overdo it as I’m detail-oriented. But for us documentation isn’t an afterthought, they’re guardrails. It sets the rules of engagement, defines how we’re measuring success. It’s saved our bacon, many times!  

Microservices: You can’t just say ‘microservices are only for Netflix or the other big companies’. It’s not the size of the team, but the type of the project. You can have a tiny one-developer project and implement it very successfully with microservices. It does add quite a bit of overhead, and there’s a point of diminishing returns. We still use monolith type approaches when it comes to throwaway proofs of concept, you can just crank it out.  

And it’s a struggle to keep these services discrete and finite. Let’s say you have a small application, how do you separate out the domain modules for your community area and say an event directory so they’re truly standalone? In the end you tend to create a quasi-ORM, where your objects have a high dependency on each other; the microservices look terrific at the app or the UI layer, but there’s a shared data layer. Or you end up with duplicated data, where the interpretation of ‘customer’ data varies so much from service to service.  

Logging is also more of a challenge – you have to put more thought into capturing and aggregating errors with your framework. 

But in general, microservices are definitely a winner and our choice of architecture. Isolation of functionality is something we really value in our designs; we need to make sure that changes to invoicing won’t have any effect on inventory management or anything else. It pays off in so many ways when it comes to scalability and reliability.  

Testing: We have QA as a separate functional team; there’s a ratio of 25 devs to every QA person. We make it clear that writing automated unit tests, performance tests, security tests – that’s all in the hands of the developers. But manual smoke tests and enforcing that the test plans actually does what it’s supposed to is all done by the QA dept. We’re huge fans of behavior driven development, where we identify a test plan, lay it out, the developer writes unit tests and QA goes through and confirms that’s what the client wanted.

With our environments, we do have a testing environment set up with dummy data; then we have a sandbox environment, with a 1 week old set of actual production data where we do performance and acceptance testing. That’s the environment the customer has full access to. We don’t do performance testing against production directly. We’re big fans of using software to mimic production loads – anywhere from 10 users/sec to 10K users/sec, along with mocks and fakes with our test layer design. 

Continuous Learning: To me continuous learning is really the heart of things. It goes all the way back to the honest retrospectives artifact in scrum – avoiding the blame game, documenting the things that can be improved at the project or process level. It’s never the fault of Dave, that guy who wrote the horrible code – why did we miss that as a best practice in our code review? Did we miss something in how we look at maintainability, security, performance? Are lead developers setting expectation properly, how can we improve in our training?  

Blame is the enemy of learning and communication. The challenge for us is setting the expectation that failure is an expected outcome, a good thing that we can learn from. Let’s count the number of failures we’re going to have, and see how good our retrospectives can get. We’re going to fail, that’s OK – how we learn from these failures?  

Usually our chances of winning come down to one thing – a humble leader. If the person at the top can swallow their pride, knows how to delegate, and recognize that it will take the entire team to be engaged and solve the problem – then a DevOps culture change is possible. But if the leader has a lot of pride, usually there’s not much progress that can be made.  

Monitoring: Monitoring is too important to leave to end of project, that’s our finish line. So we identify what the KPI’s are to begin with. Right now it revolves around three areas – performance (latency of requests), security (breach attempts), and application logs (errors returned, availability and uptime). For us, we ended up using New Relic for performance indicators, DataDog for their app layer KPI’s, and Amazon’s Inspector. The OWASP has a set of tools they recommend for scanning; we use these quite often for our static scans.  

Sometimes of course we have customers that want to go cheap on monitoring. So, quite often, we’ll just go to app level errors; but that’s our bare minimum. We always log, sometimes we don’t monitor. We had this crop up this morning with a customer – after a year or more, we went live, but all we had was that minimal logging. Guess what, that didn’t help us much when the server went down! Going bare-bones on monitoring is something customers typically regret, because of surprises like that. Real user monitoring, like you can get with any cloud provider, is another thing that’s incredibly valuable checking for things like latency across every region.  

Production Support by Developers: Initial on-calls support is handled in-house by a separate Sysops team; we actually have it in our agreement with the customer that application developers aren’t a part of that on-call rotation. If something has made it through our testing and staging environments, that knocks out a lot of potential errors. So 90% of the time a bug in production is not caused by a code change, it’s something environmental – a server reboot, a firewall config change, a SSL cert expires. We don’t want to hassle our developers with this. But, we do have them handle some bug triage – always during business hours though.  

Let’s just be honest here – these are two entirely separate disciplines, specialties. Sysops teams love ops as code and wading through server error logs – developers hate doing that work! So we separate out these duties. Yes, we sometimes get problems when we move code from a dev environment to QA – if so, there’s usually some information missing that the dev needs to add to his documentation in the handoff to sysops.  

And we love feature flags and canary releases. Just last week we rolled out an IOT project to 2000 residential homes. One feature we rolled out to only the Las Vegas homes to see how it worked. It works great – the biggest difficulty we find is documenting and managing who’s getting new features and when, so you know if a bug is coming from a customer in Group A or B. 

Automation: For us, automating everything is the #1 principle. It reduces security concerns, drops the human error factor; increases our ability to experiment faster with infrastructure our codebase. Being able to spin up environments and roll out POC’s is so much easier with automation. It all comes down to speed. The more automation you have in place, the faster you can get things done. It does take effort to set up initially; payoff is more than worth it. Getting your stuff out the door as fast as possible with small, iterative changes is the only really safe way; that’s only possible with automation. 

You would think everyone would be onboard with the idea of automation over manually logging on and poking around on VM’s when there’s trouble, but – believe it or not – that’s not always the case. And sometimes our strongest resistance to this comes from the director/CTO level!  

Security: First, we review compliance with customer – half the game is education. We ask them if they’re aware of what GDPR is – for 90% of our customers, that’s just not on their radar and it’s not really clear at this point what compliance means specifically in how we store user information. So we give them papers to review, and drop tasks into our sprints to support compliance for the developers and the sysops team with the CI/CD pipeline.  

Gamedays: Most of my clients aren’t brave enough to run something like Simian Army or Chaos Monkey on live production systems! But we do gamedays, and we love them. Here’s how that works: 

I don’t let the team know what the problem is going to be, but one week before launch – on our sandbox environments, we do something truly evil to test our readiness. And we check how things went – did alerts get fired correctly by our monitoring tools? Was the event logged properly? How did the escalation process work, and did the right people get the information they needed fast enough to respond? Did they have the access they needed to make the changes? Were we able to use our standard release process to get a fix out to production? Did we have the right amount of redundancy on the team? Was the runbook comprehensive enough, and were the responders able to use our knowledgebase to track down similar problems in the past to come up with a remedy?  

The whole team loves this, believe it or not. We learn so much when things go bump in the night. Maybe we find a problem with auto healing, or there’s an opportunity to change the design so the environments are more loosely coupled. Maybe we need to clear up our logging, or tune our escalation process, or spread some more knowledge about our release pipeline. There’s always something, and it usually takes us at least a week to fold in these lessons learned into the product before we do a hard launch. Gamedays are huge for us – so much so, we make sure it’s a part of our statement of work with the customer.  

For one recent product, we did three Gamedays on sandbox and we felt pretty dialed in. So, one week before go-live, we injected a regional issue on production – which forced the team to duplicate the entire environment into a completely separate region using cold backups. Our SLA was 2 hours; the whole team was able to duplicate the entire production set from Oregon to Virginia datacenters in less than 45 minutes! It was such a great team win, you should have seen the celebration.  

 
 


 

Advertisements

Do you think you have a book in you? I do too.

You can’t wait for inspiration. You have to go after it with a club. – Jack London

Almost done with my novel, I can sense it. I think there’s about a month left until the first draft is done, and – if all goes well – all the revisions will be done and it’ll be published, perhaps December or January. I’m very proud of it so far, and I think – I hope – it’ll leave a mark. I wanted to share with you what I learned, because I almost waited too long.

I used to think you needed to be brilliant, or wait for inspiration in some café. Turns out, that’s really not the case. What I found is, it’s just a grind. You show up at the café, and you start writing, 8 in the morning – and you don’t get up until at least 2 pm. If you do that, you’ll have at least 1,000 words down – and maybe more like 3,000.

They may not be good. Some days you’ll struggle cranking out 1,000 wimpy little words, and it’ll be hot garbage. Other days, you’ll fly through, and it’ll sing off the page. Regardless – you do it every day. Five days a week, as best you can.

Guess what? After 6-9 months, you’ve got yourself a first draft of a nice little book there.

Of course, we’re not DONE yet. Now you’ve got to rewrite, where you take that pile of bricks and try to make it into a house. But, my friend, you are ALMOST there. And all it took was sitting down and writing that first page.

Do everything they tell you to do. Tell your friends that you’re writing a book, so you’re committed. (That was huge for me. Telling people I was a writer gave me a little ego boost and I found, over time, it actually became true.) Make it a topic you really like – you don’t want to spend a year or more of your life with something you aren’t truly interested in. And find a publisher before you invest too much time in your book, and (hopefully) get a contract. A good editor will help with guiding you so what you write will be worth reading. Even if you end up self-publishing, going through the work of putting together a proposal and an outline is so worth it.

For me, I really enjoyed the research phase. But if you’re not careful, you’ll spend all your time studying and looking through other people’s work – and not doing any of your own. So, on the bad days, sometimes I’d do very little writing, just research. But usually I’d force myself to write those 1,000 words first – and THEN treat myself with a book or video for research.

Just don’t wait too long. One of my favorite authors is Norman Maclean, who wrote “A River Runs Through It” and – posthumously – “Young Men and Fire”, both incredible classics. The tragic thing was, he started so late – when he was 71 years old! It’s such a terrible waste.

Don’t wait for inspiration. What you’ve got to say is something that needs to be shared, that will add value to the world. Set a goal, tell your friends about it, and start plugging away.

So, do you think you have a book in you? Something amazing and creative, something you’ve never seen anywhere before? I do too. And I can’t wait to read your first book!

DORA 2018 State of DevOps report is out!

Hey guys the 2018 State of DevOps report from Puppet/DORA is out! As always, those guys have done an amazing job. You owe it to yourself to download it and check it over, and pass it along.

Here’s the points I found most powerful:

  1. DevOps isn’t a fad; it’s proven to make companies faster and less wasteful in producing new features.
  2. Slower is not safer. Companies releasing every 1-6 months had abysmally slow recovery times.
  3. We can’t eliminate toil or manual work completely – but in low performing companies, it’s basically all we do. High-performers rarely have it make more than 30% of the workday.
  4. Outsourcing an entire function – like QA, or production support – remains a terrible idea. It represents a dramatic cap on innovation and ends up costing far more in delays than you’ll ever see with saved operational costs.
  5. Shift Left” on security continues to grow in popularity – because it works. The best examples are where implementing it early is made as easy and effortless as possible.    

More below. Check it out for yourself, it’s such great work and very easy to read!

 

The difference between the greats and the not-so-great continues to widen: We’ve heard executives describe DevOps as being a “buzz word” or a “fad”. Ten years into this movement, this seems more and more out of touch with reality. Companies that take DevOps seriously as part of their DNA perform better. They deploy code 46x more frequently; they’re faster to innovate (2,555 times faster lead time). And they do it more safely. Elite performers have 7x lower change failure rate, and can recover 2,604x faster.

DevOps has been proven to lead to faster innovation and change AND produce higher quality work. Honestly, does that sound like a fad to you? (I wonder sometimes if the GM and Chrysler execs in the 1970’s were saying the same thing about Toyota…)

(above image and all others copyright Puppet/DORA 2018)

Releasing infrequently for “safety” is anything but. Many organizations gate releases so they’re spread out over weeks or months, in an attempt to prevent bugs or defects. This backfires terribly; while bug rates may drop, it means their time to recover is disastrously slow. For example, companies that release every 1-6 months have the exact same MTTR – 1-6 months. (!!!!)

“When failures occur, it can be difficult to understand what caused the problem and then restore service. Worse, deployments can cause cascading failures throughout the system. Those failures take a remarkably long time to fully recover from. While many organizations insist this common failure scenario won’t happen to them, when we look at the data, we see five percent of teams doing exactly this—and suffering the consequences. At first glance, taking one to six months to recover from a system failure seems preposterous. But consider scenarios where a system outage causes cascading failures and data corruption, or when multiple unknown systems are compromised by intruders. Several months suddenly seems like a plausible timeline for full recovery.”

Toil and manual work: Elite and high performing orgs do far less manual work. Just look at the percent of people’s time wasted in low performing orgs doing things like hacking out manual configs on a VM, or smoketesting, or trying to push a deployment out the door using Xcopy. Someone on an elite, high performing company might spend 20-30% of their time doing this type of shovel work; at lower performing companies, it’s basically 100% plus of their time.

 

Think Twice Before You Outsource: The powerful example of Maersk shows the cost of outsourcing entire functions (like testing, or Operations) to external groups. The 2018 study proves that outsourcing an entire function leads to delays as work is batched and high-priority items wait on lower-priority work in queue. This is the famous handoff waste and directly against key DevOps principles of cross functional teams:

“Analysis shows that low-performing teams are 3.9 times more likely to use functional outsourcing (overall) than elite performance teams, and 3.2 times more likely to use outsourcing of any of the following functions: application development, IT operations work, or testing and QA. This suggests that outsourcing by function is rarely adopted by elite performers. …Misguided performers also report the highest use of outsourcing, which likely contributes to their slower performance and significantly slower recovery from downtime. When working in an outsourcing context, it can take months to implement, test, and deploy fixes for incidents caused by code problems.”

In Maersk’s case, just the top three features represented a delay cost of $7 million per week. So while outsourcing may seem to represent a chance to cut costs, data shows that the delay costs and drag on your deployment rate may far outweigh any supposed savings.

Lean product management: the survey went into some detail about the qualities of Lean Product Management that they found favorable. Here’s a snapshot:

Security by audit versus part of the lifecycle: Great thoughts on how shifting left on security is a key piece of delivery. They recommend making security easy, with frameworks of preapproved libraries, packages and toolchains, and reference examples of implementation, versus late-breaking audits and the disruption and delays that causes:

“Low performers take weeks to conduct security reviews and complete the changes identified. In contrast, elite performers build security in and can conduct security reviews and complete changes in just days. …Our research shows that infosec personnel should have input into the design of applications and work with teams (including performing security reviews for all major features) throughout the development process. In other words, we should adopt a continuous approach to delivering secure systems. In teams that do well, security reviews do not slow down the development process.”

 

So, that’s my book report. Loved it, as always, though I’m not onboard with everything there. For example, they’ve coined a new phrase – SDO, “Software Delivery and Operational Performance.” Sorry, but to me that’s reliability – the “R” in SRE, which has been around since 2003 in the software world. I don’t see the need for another acronym around that. And they’re splitting hairs a little when separating out automated testing from continuous testing, but I might be wrong on that.

As usual, it’s brilliant, data-driven, and really sets the pace for the entire growing movement of DevOps. LOVE, love the work that Puppet and DORA are producing – keep it up guys!

 

 

 

 

DevOps Stories – Interview with Anne Steiner

 


Anne Steiner is the Vice President of Product Agility for cPrime. In her role, Anne sets up cross-team discovery cadences, scales product thinking in large organizations, and teaches and mentors stakeholders in leadership and product roles. Anne and her team have helped companies of all shapes and sizes to transform from traditional, project-thinking to become product-driven organizations that emphasize continuous learning. She also actively promotes building communities of practitioners in the Minneapolis/St. Paul area and frequently speaks at national and regional events. She served in the United States Marine Corps as a logistics/embarkation non-commissioned officer in the early 2000’s.

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

 

You know, people think of the military as hierarchical, rigid – but in my experience the military is incredibly flexible and dynamic. It has to be to survive in war, and war is becoming more dynamic. Decision making keeps getting pushed down to lower and lower levels.

Just for example, look how we start with boot camp. It starts with dehumanization – with the goal of teaching people that we are all the same; nobody’s special. We take away your clothes, if you’re a guy we shave off your hair. Then we teach the lesson – you do everything as a team. The USMC sets up tasks that are impossible to complete in the time allowed alone. For example, the beds are so close together that if you’re asked to make a bed – your rack mate has to help you with one side of the lower bunk, and then you help her with your side of her bunk. The lesson is, nobody succeeds alone – in boot camp, you can be perfectly right and still get screamed at. I remember once, I made my bed perfectly; the corners were good, and I still got screamed at because I had known what needed to be done and I didn’t help my teammate. The whole process is to drill into your head – this is your family now – you must succeed as a team.


The military’s approach to requirements: Besides shared values, the concept of how orders are delivered in the military has some application to DevOps. In the military, there’s a separation of concerns between the officers who give orders and the enlisted people who carry them out – similar to the division between team members and management. These two groups have very different points of view and misunderstandings or conflicts could hamper an operation or cost lives. To address this – nothing significant happens without a written order describing the commander’s intent. It’s a standard 5 paragraph order that follows the SMEAC format – Situation, Mission, Execution, Administration, and Communication.

Now, the military doesn’t expect its people to document every possible scenario or to follow the words in the order blindly – because we need our people to make independent decisions autonomously as the situation ultimately changes mid-operation. So we don’t fill in all the details but provide the high level intent. The order describes what the commander wants to accomplish, the overall goals and the time frame – you are following orders as long as you’re following the intent and haven’t violated some other direction provided. At cPrime we do the same thing, where we teach product teams something called collaborative framing. That describes what we’re doing, why we’re doing it, and who we are doing it for. That’s pretty similar to the way orders are used in the Marines – the orders provide the high-level strategy and context, and people are allowed to fill in the implementation details later.

I wish this happened more often in the development world. We shouldn’t feel like we have to spoon feed everything to dev teams with detailed requirements – what if we just gave them the intent? We could define the operating requirements, the business goals, and allow them to figure out how to solve the problem.

You want to be told why. A lot of times we aren’t told “why”, just “what” as developers. That’s what surprised me about the military – there was never a leader that I worked with that I couldn’t ask why, in a respectful manner, and be given context. That helps you understand the mission. It always surprised me how open leadership was to questions about orders.

Now I should say – orders aren’t open to question or debate all the time. Sometimes in a crunch we need orders to be followed without question; but that’s actually not the norm, contrary to what most people think.

 

Keys to Success: What separates out the successful orgs? I find three traits winning organizations have in common:

  • Bold leadership that’s willing to take risks
  • A culture of agility and learning
  • Starting with a small success story

DevOps culture changes obviously come easier with smaller companies; in larger orgs you have to find a pocket where it’s okay to experiment or where a bold leader can nurture and shelter an effort. Once you get to that point where you can start telling stories – we hit this obstacle, and here we hit some snags, but look at these results – that’s where you start to see culture change. You can’t just come in the door and say “We’re going to take risks and become a learning org!”, because you haven’t proven yourself yet. I’m always looking for that right kind of leadership protection, a willingness to experiment, and a group that wants to learn and try something different. That’s your beachhead!

A Single Mission: One of the key factors I see in many successful organizations with their DevOps transformations is to have a legitimate set of shared measures, a shared mission. In the USMC, we have a standard mission – to make Marines, and to win battles. That’s the single mission, and if something in the orders doesn’t relate to that directly – we throw it out. In the software world, it’s not that simple. Every product has a vision, every company has a mission statement. But how many can articulate that simply? Netflix does a great job with a shared mission for example – their shared goals are to retain subscriptions and to increase subscriptions. Whatever you do needs to be aligned against one of these. Can you prove that your project aligns against that? Otherwise you’ll see antipatterns like IT teams saying we have 100% uptime – yeah, that’s great, but you’ve got a crappy product and your customers are unhappy. That’s not product thinking, a clear common goal that everyone can rally around. 

 

Flexibility and Innovation: There’s a lot of people out there writing books on Agile, and quite a few are well written. But if you slam it on the table, it’s not going to work like it says in the book. Then what are you going to do? The teams that are successful are the ones that can implement this or better yet the parts of it that they think will add value, fail, modify it to their situation, and win anyway. That’s one of the things I love about the way the Agile Manifesto was written, because it is principle based.  We see a lot of organizations struggle because they bring in some “expert” who comes in with a checklist and says, no, you’re not doing scrum unless you’re doing these things. Well, who cares, as long as you’re delivering awesome products?  

As a culture, the USMC takes as a point of pride that it is always asked to do more with less – to us, that “Adapt, Improvise and Overcome” mantra is a real point of pride. I think it comes in part from how we were founded. The Marine Corps has the smallest budget of the branches. There’s not a lot of money flowing through the organization. So that helps us – we realize, no matter what happens, it’s probably not going to work the first time – we’ll adapt and change. Traditionally, we think in the software world that change is bad, we have to limit it, a risk. Well change is inevitable, we should expect it – and win even if we have to come up with a new solution on the fly. 

DevOps Stories – Sam Guckenheimer of Microsoft

The following content is shared from an interview with Sam Guckenheimer, product owner for Visual Studio Team Services. When people ask us “how Microsoft did it” with our DevOps transformation, we often think of the lessons Sam shared with us during our talk. There is so much to learn from here that can help other companies in making their own journey to better, faster, and safer delivery of value!

These and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

 

One thing I want to start with – it really annoys me when I read grandiose claims that DevOps is broken in some way. We know that’s just not the case – Gartner tells us at least half of enterprises have something going on with DevOps and they all want to do more. If you look at Agile, which began with the Agile Manifesto back in 2001 – and compare it with where it was as a movement a decade later in 2011 – well, that would look very much like where we are at today, about 10 years after DevOps first began as a concept back in 2009. The trends are really clear, and our success rate and the maturity of the tools and processes part is only going to go up.

Avoid Massive Reorgs: It’s just not true when some say you have to “blow up the organization” to make DevOps work. Change is necessary – you have to get rid of all the handoffs, the waste, and really follow the Lean model with disintermediating developers from production and from customers. But that doesn’t mean you need to make drastic moves and that’s not how we did it at Microsoft. It can be done in an evolutionary way.

Most companies don’t have the luxury of saying, “let’s blow it up” and just jettison decades of code with their legacy applications and start over. That’s your lifeblood! I know that was true with us on the Visual Studio team; we had to go about things in a very gradual way so we didn’t threaten the jugular of our company.

Find Your North Star: Six years ago we found our North Star – how we wanted to go about delivering value using the DevOps mindset – and we pointed to it, saying – “we want to be a world class engineering organization”. Everything we’ve done since then, every major decision we’ve made, has been built around measuring our progress towards that mission.

Jez Humble has joked a few times about some companies trying to “sprinkle magical microservices fairy dust” over things to magically get cloud services architecture. I have to say – there was no fairy dust for us. It required progressive change, some very conscious hard engineering changes, and walking the walk.

Just for example, overhauling our test portfolio and moving to Git took three years. We kept deprecating and replacing older, slow tests with the faster ones incrementally – sprint by sprint, test by test. Now it takes us about 7 minutes to run 70K unit tests before a developer commits to master. But the value is incredible for us – before that, we had these long-running integration tests that had never run completely green, that always required manual intervention and was killing our release flow.

Everything – our refactoring from monolith to microservices, our safe deployment practices, building a lifecycle culture, even our datacenter standup automation – required a lot of work and a multi-year commitment, persistence despite setbacks. We knew though where the “North Star” was and we were committed. Our approach was – set the goal, measure the progress, and keep going until we get there.

Production Support: Shifting to a production support mindset was a big change and of course not everyone was onboard, especially at first. We knew that would be our most important and critical win – making sure the delivery teams were onboard and happy with what was going on. We measured this as one of our first KPI’s. We would do regular surveys of engineering satisfaction and go into depth about their jobs, how tooling was supporting their jobs, the process was supporting their jobs – and what we saw was a steady rise in satisfaction.


Just for example on this, one of the things we measured was alerting frequency – are we getting to the right person the first time? That’s something we are always watching – if you’re waking people up at 2 in the morning, it had better be the right person. We needed to make sure that we are paying attention to the things that matter to people’s lives and their satisfaction with their jobs.

When you’re genuine, you get a genuine response. This all helps build that high-trust culture that Gene Kim and others have emphasized as key.

The concept of servant leadership has been a big part of our change; good managers care about their team and look for ways to make their jobs easier. That’s the Andon cord philosophy – anyone on the floor can pull that cord, stop the line if needed – and the manager comes over, the root cause is identified and rolled into the process so future incidents don’t happen. So in our case – we don’t close out livesite incidents for example until the fix is identified and in the backlog so it won’t happen again.

Setting Goals and Metrics: Our North Star remains fixed but we are always redefining how we want to get there. Every 6 months we select, epic by epic, 3 or 4 goals that define success for us over the next six months, and the specific metrics that will define them. We publish these and they’re flowed all the way up the management chain. Those goals and metrics on an epic level don’t change for those six months. Each person on the feature crews know which epics they’re working on and can ask each sprint – what are the next few things we need to do to move the needle along these goals? They look ahead about 3 sprints in terms of what they’re trying to do – no more than that. That level of planning is key for us to make progress in an iterative way and minimize disruption.

In the beginning, we thought it was really not a big deal to figure out the metrics and focus on the right thing and so forth. It turns out that finding the right metrics is as complicated as designing the right feature. It’s really not obvious what in terms of measurement and what you’re striving for. Very frequently, you don’t have an out of the box way of doing the telemetry – so you need to instrument for the business API’s you want.

A really clear example on this – one of the metrics that we’re interested in is, how many developers are working on projects that are doing continuous delivery to Azure? That’s a very hard thing to count. You have to make several leaps of instrumentation and joins in order to answer that. Asking the question clearly and getting a way of gathering data on it is a real engineering problem – and one that typically is made to sound much simpler and less of an obstacle than it really is on the web or in books about lean customer analytics.

This goes way, way beyond your standard # of site visitors or simple generic use cases for a website. Until you start getting down to brass tacks and define what the things are that we care about as a business and why – it’s difficult to appreciate how challenging it is to come up with the right measurables.

Value Stream Mapping: I’m going to shock you a little here – we don’t do normal value stream mapping here. My observation is that value stream mapping is really effective when you want to get people on the same page and get some momentum going towards a DevOps movement. Once you show people – wow, it takes us 60 days to get something to production, and most of that is wait time – 5 days for approval here, 7 days for testing here – that’s great to get everyone to see the elephant in the room. It never fails to shock people once they see how huge that bucket of idle time is!

For us, we’re past that initial shock phase. We focus heavily on all the things that value stream mapping attacks in terms of handoffs, idle time versus process time, etc – but it is definitely not something you need to do on an ongoing basis, in my opinion.

Two Key Antipatterns: I see two key failings that sometimes trips organizations up. First, people often think in terms of formulas – you need to do X with the people, Y with the process, and Z with the tools – and think of each of these as being independent pillars, that you can tackle one at a time in phases. It ends up being counterproductive, making things more complicated and lengthening things, because in reality all these things are interrelated and need to be thought of together.

My advice is to fight the tendency to take a single practice, however good, and try to implement it in isolation. Think in terms of all three columns as supporting a single building together; each improvement should touch on people, process, and tools in some way and make it a little better. Focus on the quick wins – try to stairstep your maturity, building something small that quickens that release cycle and delivers feedback faster.

The second antipattern is not getting the right balance of leadership and delegation. You need to have obvious skin in the game from leadership, and initiative from individual practitioners. Think back to that great book “Drive” by Dan Pink, which stressed the leadership value between Autonomy, Mastery, and Purpose. You are going to need to spark people and get them enthusiastic, active, and feeling like they control their destiny – autonomy.

It’s really part art and science, because that autonomy has to be balanced with purpose, which is driven consistently and forcefully by management. And if you look at most of the current execs at Microsoft, you will see that they practice both high empathy and engage deep technically.

Mission is key for us but it goes beyond just a few words or a slogan. We put up guardrails, very clear rules of the road that specifies “here is what you need to do to check your code into master.” We have a very clear definition of done that is common in every team – “code delivered with tests and telemetry and deployed in production worldwide.”

This is the exact opposite of “it works on my machine” – and everyone knows it. If you’re doing new work, there’s a set of common services we provide, including sample code and documentation. So no one has to reinvent the wheel when it comes to telemetry for example – you might improve on it, but you would never have to deliver this from scratch, it’s reused from a common set of services.