Uncategorized

DevOps Stories – Interview with Anne Steiner

 


Anne Steiner is the Vice President of Product Agility for cPrime. In her role, Anne sets up cross-team discovery cadences, scales product thinking in large organizations, and teaches and mentors stakeholders in leadership and product roles. Anne and her team have helped companies of all shapes and sizes to transform from traditional, project-thinking to become product-driven organizations that emphasize continuous learning. She also actively promotes building communities of practitioners in the Minneapolis/St. Paul area and frequently speaks at national and regional events. She served in the United States Marine Corps as a logistics/embarkation non-commissioned officer in the early 2000’s.

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

 

You know, people think of the military as hierarchical, rigid – but in my experience the military is incredibly flexible and dynamic. It has to be to survive in war, and war is becoming more dynamic. Decision making keeps getting pushed down to lower and lower levels.

Just for example, look how we start with boot camp. It starts with dehumanization – with the goal of teaching people that we are all the same; nobody’s special. We take away your clothes, if you’re a guy we shave off your hair. Then we teach the lesson – you do everything as a team. The USMC sets up tasks that are impossible to complete in the time allowed alone. For example, the beds are so close together that if you’re asked to make a bed – your rack mate has to help you with one side of the lower bunk, and then you help her with your side of her bunk. The lesson is, nobody succeeds alone – in boot camp, you can be perfectly right and still get screamed at. I remember once, I made my bed perfectly; the corners were good, and I still got screamed at because I had known what needed to be done and I didn’t help my teammate. The whole process is to drill into your head – this is your family now – you must succeed as a team.


The military’s approach to requirements: Besides shared values, the concept of how orders are delivered in the military has some application to DevOps. In the military, there’s a separation of concerns between the officers who give orders and the enlisted people who carry them out – similar to the division between team members and management. These two groups have very different points of view and misunderstandings or conflicts could hamper an operation or cost lives. To address this – nothing significant happens without a written order describing the commander’s intent. It’s a standard 5 paragraph order that follows the SMEAC format – Situation, Mission, Execution, Administration, and Communication.

Now, the military doesn’t expect its people to document every possible scenario or to follow the words in the order blindly – because we need our people to make independent decisions autonomously as the situation ultimately changes mid-operation. So we don’t fill in all the details but provide the high level intent. The order describes what the commander wants to accomplish, the overall goals and the time frame – you are following orders as long as you’re following the intent and haven’t violated some other direction provided. At cPrime we do the same thing, where we teach product teams something called collaborative framing. That describes what we’re doing, why we’re doing it, and who we are doing it for. That’s pretty similar to the way orders are used in the Marines – the orders provide the high-level strategy and context, and people are allowed to fill in the implementation details later.

I wish this happened more often in the development world. We shouldn’t feel like we have to spoon feed everything to dev teams with detailed requirements – what if we just gave them the intent? We could define the operating requirements, the business goals, and allow them to figure out how to solve the problem.

You want to be told why. A lot of times we aren’t told “why”, just “what” as developers. That’s what surprised me about the military – there was never a leader that I worked with that I couldn’t ask why, in a respectful manner, and be given context. That helps you understand the mission. It always surprised me how open leadership was to questions about orders.

Now I should say – orders aren’t open to question or debate all the time. Sometimes in a crunch we need orders to be followed without question; but that’s actually not the norm, contrary to what most people think.

 

Keys to Success: What separates out the successful orgs? I find three traits winning organizations have in common:

  • Bold leadership that’s willing to take risks
  • A culture of agility and learning
  • Starting with a small success story

DevOps culture changes obviously come easier with smaller companies; in larger orgs you have to find a pocket where it’s okay to experiment or where a bold leader can nurture and shelter an effort. Once you get to that point where you can start telling stories – we hit this obstacle, and here we hit some snags, but look at these results – that’s where you start to see culture change. You can’t just come in the door and say “We’re going to take risks and become a learning org!”, because you haven’t proven yourself yet. I’m always looking for that right kind of leadership protection, a willingness to experiment, and a group that wants to learn and try something different. That’s your beachhead!

A Single Mission: One of the key factors I see in many successful organizations with their DevOps transformations is to have a legitimate set of shared measures, a shared mission. In the USMC, we have a standard mission – to make Marines, and to win battles. That’s the single mission, and if something in the orders doesn’t relate to that directly – we throw it out. In the software world, it’s not that simple. Every product has a vision, every company has a mission statement. But how many can articulate that simply? Netflix does a great job with a shared mission for example – their shared goals are to retain subscriptions and to increase subscriptions. Whatever you do needs to be aligned against one of these. Can you prove that your project aligns against that? Otherwise you’ll see antipatterns like IT teams saying we have 100% uptime – yeah, that’s great, but you’ve got a crappy product and your customers are unhappy. That’s not product thinking, a clear common goal that everyone can rally around. 

 

Flexibility and Innovation: There’s a lot of people out there writing books on Agile, and quite a few are well written. But if you slam it on the table, it’s not going to work like it says in the book. Then what are you going to do? The teams that are successful are the ones that can implement this or better yet the parts of it that they think will add value, fail, modify it to their situation, and win anyway. That’s one of the things I love about the way the Agile Manifesto was written, because it is principle based.  We see a lot of organizations struggle because they bring in some “expert” who comes in with a checklist and says, no, you’re not doing scrum unless you’re doing these things. Well, who cares, as long as you’re delivering awesome products?  

As a culture, the USMC takes as a point of pride that it is always asked to do more with less – to us, that “Adapt, Improvise and Overcome” mantra is a real point of pride. I think it comes in part from how we were founded. The Marine Corps has the smallest budget of the branches. There’s not a lot of money flowing through the organization. So that helps us – we realize, no matter what happens, it’s probably not going to work the first time – we’ll adapt and change. Traditionally, we think in the software world that change is bad, we have to limit it, a risk. Well change is inevitable, we should expect it – and win even if we have to come up with a new solution on the fly. 

Advertisements

DevOps Stories – Sam Guckenheimer of Microsoft

The following content is shared from an interview with Sam Guckenheimer, product owner for Visual Studio Team Services. When people ask us “how Microsoft did it” with our DevOps transformation, we often think of the lessons Sam shared with us during our talk. There is so much to learn from here that can help other companies in making their own journey to better, faster, and safer delivery of value!

These and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

 

One thing I want to start with – it really annoys me when I read grandiose claims that DevOps is broken in some way. We know that’s just not the case – Gartner tells us at least half of enterprises have something going on with DevOps and they all want to do more. If you look at Agile, which began with the Agile Manifesto back in 2001 – and compare it with where it was as a movement a decade later in 2011 – well, that would look very much like where we are at today, about 10 years after DevOps first began as a concept back in 2009. The trends are really clear, and our success rate and the maturity of the tools and processes part is only going to go up.

Avoid Massive Reorgs: It’s just not true when some say you have to “blow up the organization” to make DevOps work. Change is necessary – you have to get rid of all the handoffs, the waste, and really follow the Lean model with disintermediating developers from production and from customers. But that doesn’t mean you need to make drastic moves and that’s not how we did it at Microsoft. It can be done in an evolutionary way.

Most companies don’t have the luxury of saying, “let’s blow it up” and just jettison decades of code with their legacy applications and start over. That’s your lifeblood! I know that was true with us on the Visual Studio team; we had to go about things in a very gradual way so we didn’t threaten the jugular of our company.

Find Your North Star: Six years ago we found our North Star – how we wanted to go about delivering value using the DevOps mindset – and we pointed to it, saying – “we want to be a world class engineering organization”. Everything we’ve done since then, every major decision we’ve made, has been built around measuring our progress towards that mission.

Jez Humble has joked a few times about some companies trying to “sprinkle magical microservices fairy dust” over things to magically get cloud services architecture. I have to say – there was no fairy dust for us. It required progressive change, some very conscious hard engineering changes, and walking the walk.

Just for example, overhauling our test portfolio and moving to Git took three years. We kept deprecating and replacing older, slow tests with the faster ones incrementally – sprint by sprint, test by test. Now it takes us about 7 minutes to run 70K unit tests before a developer commits to master. But the value is incredible for us – before that, we had these long-running integration tests that had never run completely green, that always required manual intervention and was killing our release flow.

Everything – our refactoring from monolith to microservices, our safe deployment practices, building a lifecycle culture, even our datacenter standup automation – required a lot of work and a multi-year commitment, persistence despite setbacks. We knew though where the “North Star” was and we were committed. Our approach was – set the goal, measure the progress, and keep going until we get there.

Production Support: Shifting to a production support mindset was a big change and of course not everyone was onboard, especially at first. We knew that would be our most important and critical win – making sure the delivery teams were onboard and happy with what was going on. We measured this as one of our first KPI’s. We would do regular surveys of engineering satisfaction and go into depth about their jobs, how tooling was supporting their jobs, the process was supporting their jobs – and what we saw was a steady rise in satisfaction.


Just for example on this, one of the things we measured was alerting frequency – are we getting to the right person the first time? That’s something we are always watching – if you’re waking people up at 2 in the morning, it had better be the right person. We needed to make sure that we are paying attention to the things that matter to people’s lives and their satisfaction with their jobs.

When you’re genuine, you get a genuine response. This all helps build that high-trust culture that Gene Kim and others have emphasized as key.

The concept of servant leadership has been a big part of our change; good managers care about their team and look for ways to make their jobs easier. That’s the Andon cord philosophy – anyone on the floor can pull that cord, stop the line if needed – and the manager comes over, the root cause is identified and rolled into the process so future incidents don’t happen. So in our case – we don’t close out livesite incidents for example until the fix is identified and in the backlog so it won’t happen again.

Setting Goals and Metrics: Our North Star remains fixed but we are always redefining how we want to get there. Every 6 months we select, epic by epic, 3 or 4 goals that define success for us over the next six months, and the specific metrics that will define them. We publish these and they’re flowed all the way up the management chain. Those goals and metrics on an epic level don’t change for those six months. Each person on the feature crews know which epics they’re working on and can ask each sprint – what are the next few things we need to do to move the needle along these goals? They look ahead about 3 sprints in terms of what they’re trying to do – no more than that. That level of planning is key for us to make progress in an iterative way and minimize disruption.

In the beginning, we thought it was really not a big deal to figure out the metrics and focus on the right thing and so forth. It turns out that finding the right metrics is as complicated as designing the right feature. It’s really not obvious what in terms of measurement and what you’re striving for. Very frequently, you don’t have an out of the box way of doing the telemetry – so you need to instrument for the business API’s you want.

A really clear example on this – one of the metrics that we’re interested in is, how many developers are working on projects that are doing continuous delivery to Azure? That’s a very hard thing to count. You have to make several leaps of instrumentation and joins in order to answer that. Asking the question clearly and getting a way of gathering data on it is a real engineering problem – and one that typically is made to sound much simpler and less of an obstacle than it really is on the web or in books about lean customer analytics.

This goes way, way beyond your standard # of site visitors or simple generic use cases for a website. Until you start getting down to brass tacks and define what the things are that we care about as a business and why – it’s difficult to appreciate how challenging it is to come up with the right measurables.

Value Stream Mapping: I’m going to shock you a little here – we don’t do normal value stream mapping here. My observation is that value stream mapping is really effective when you want to get people on the same page and get some momentum going towards a DevOps movement. Once you show people – wow, it takes us 60 days to get something to production, and most of that is wait time – 5 days for approval here, 7 days for testing here – that’s great to get everyone to see the elephant in the room. It never fails to shock people once they see how huge that bucket of idle time is!

For us, we’re past that initial shock phase. We focus heavily on all the things that value stream mapping attacks in terms of handoffs, idle time versus process time, etc – but it is definitely not something you need to do on an ongoing basis, in my opinion.

Two Key Antipatterns: I see two key failings that sometimes trips organizations up. First, people often think in terms of formulas – you need to do X with the people, Y with the process, and Z with the tools – and think of each of these as being independent pillars, that you can tackle one at a time in phases. It ends up being counterproductive, making things more complicated and lengthening things, because in reality all these things are interrelated and need to be thought of together.

My advice is to fight the tendency to take a single practice, however good, and try to implement it in isolation. Think in terms of all three columns as supporting a single building together; each improvement should touch on people, process, and tools in some way and make it a little better. Focus on the quick wins – try to stairstep your maturity, building something small that quickens that release cycle and delivers feedback faster.

The second antipattern is not getting the right balance of leadership and delegation. You need to have obvious skin in the game from leadership, and initiative from individual practitioners. Think back to that great book “Drive” by Dan Pink, which stressed the leadership value between Autonomy, Mastery, and Purpose. You are going to need to spark people and get them enthusiastic, active, and feeling like they control their destiny – autonomy.

It’s really part art and science, because that autonomy has to be balanced with purpose, which is driven consistently and forcefully by management. And if you look at most of the current execs at Microsoft, you will see that they practice both high empathy and engage deep technically.

Mission is key for us but it goes beyond just a few words or a slogan. We put up guardrails, very clear rules of the road that specifies “here is what you need to do to check your code into master.” We have a very clear definition of done that is common in every team – “code delivered with tests and telemetry and deployed in production worldwide.”

This is the exact opposite of “it works on my machine” – and everyone knows it. If you’re doing new work, there’s a set of common services we provide, including sample code and documentation. So no one has to reinvent the wheel when it comes to telemetry for example – you might improve on it, but you would never have to deliver this from scratch, it’s reused from a common set of services.

DevOps and burnout – it’s a real thing

Today I woke up to the news that one of my favorite heroes in the food world, Anthony Bourdain, was found dead in his hotel room – an apparent suicide. He leaves behind an 11 year old daughter and a longtime girlfriend and anguishing questions that will likely never be resolved by those that love him. It’s the second suicide I’ve heard of this week. It seems like now is a good time to talk about burnout and job stress. If “DevOps is compassion”, as John Willis is fond of saying, we really need to do a better job in our industry of protecting our people from the stress that is claiming so many lives.

Note – these aren’t my words – they come from my cowriter and good friend Knox Lively. The stories he’s telling below are real and exposes a real problem that’s causing a hidden but very real emotional and physical health catastrophe in our field. Please give this some thought, and spend more time with the ones you love.

Burnout in our industry is common, and it often impacts the brightest, most positive contributors to the team. The symptoms include feeling exhausted, cynical, or ineffective; little or no sense of accomplishment in your work; and feelings about your work negatively affecting other aspects of your life. We’ve all seen the impacts on the lives of people around us; broken families, severe depressions, and even suicide.

I think it’s safe to say that most of us have at least one story related to the topic of burnout, illness, or even death as a result of misaligned work ethics and objectives from the individual to the organizational level. I have a couple of personal anecdotes I’d like to share surrounding this topic.

The first of which involves a brilliant architect at a software company I worked with at a startup in Austin, Texas. Some of you may have a similar person in your organization, they are always the last person you call. Whatever the issue, whatever the time, you can count on them to fix, or at least know how to fix the problem. This particular person was involved in a very tedious, multi-month port from one application server to the another. All at the recommendation of a C-level exec who talked to another C-level exec on a plane who happened to mention “application server x is n percent faster than application server y”. You’ve heard these kinds of stories too I presume? Anyways, right in the middle of this port, this person was asked by management to not ride his brand-new Ducati bike that he’d recently purchased. For those unfamiliar, Ducati has a reputation for being an incredibly powerful performance motorcycle. He was asked this, as no stretch of the imagination is needed, because if something were to happen to him, g-d forbid, the company would be in dire straits. This was just one of the many ways his job impacted his life in an unhealthy way. He often got to wear the hat of the hero, this, coupled with his uncanny expertise and the fact that the organization had allowed unhealthy work habits to remain unchecked, he had unknowingly begun to build himself a prison. Neither he nor the organization is at complete fault here as we are dealing with something even larger, a cultural problem.

The second anecdote I’ll share with you highlights the real dangers of stress, even when handled appropriately and in accordance with the best and current practices for handling said stress. This person had everything a software engineer could ever want. He was highly paid, found great satisfaction in his job, and worked for a company he believed in. The trifecta in terms of a career. In addition to career success, he had a very happy and fulfilled personal life. He had a wife, kids, was an avid lover of the outdoors, as well as ate right and exercised. He seemed to have all of his bases covered in terms of having a holistic and balanced life, and most would agree. None of this, however, prevented him from collapsing on his mountain bike and dying of a massive heart attack. It came as a shock to the whole company. How could this person have a heart attack? Everyone mentally checked off the boxes they’d read dozens of times for how to have a healthy heart. He met them all. The one thing that wasn’t accounted for was the great deal of stress this person had been under. Remember the earlier anecdote about the architect? This is the same person. He’d become indispensable to the companies’ daily operations, which meant 60+ hour work weeks for years, because without him often things did not get done. This on top of the side projects he had worked on over the years meant he spent most of his time working and relatively little time to decouple his stress.


(image credit helpguide.org)

 

In the book we’re writing, we’re proposing DevOps as a solution in part to a life spun out of control. We believe – and experience shows – that moving more work from the “have-to” drudge work to more creative, automated, and sparkly-techy side of the page will help over time in reducing our stress load and improving our quality of life. But there’s no denying it – sometimes teams get hooked on the endorphins and rush that comes from the long nights getting a release fixed and out the door.

Think of the unforgettable character of Brent in the Phoenix Project – the one person no one could live without, the single, irreplaceable point of failure – and bottleneck – for everything the team did. Brent and the people like him are locked in a state other authors have called “full catastrophe living”, a prison they’ve built themselves by failing to see the need to improve processes and share information with others. In this case, they’re getting a short-term payoff in several areas – the warm glow of feeling respected and irreplaceable, and a certain amount of job security. This heroism comes with a high price though in terms of the team’s overall capacity and ability to learn, and over time inevitably on the health of these enterprise Atlases. Convincing these heroes of the value of DevOps is often a hard and long road; often it takes a top-down commitment to change behavior to a more sustainable pattern to overcome resistance from this point of view.

This topic of burnout and stress in the tech industry is worthy of a whole other book, and one that we can barely even begin to scratch the surface on. In terms of general tips for reducing stress on a personal level here are just a few key tips to begin your journey to a healthier work/life balance.

The first is to talk, whether it be your partner, your boss, or a therapist. Don’t assume you’re the only one under such pressure or that to talk about it is weak. We often assume that others are going through the same so we should just “suck it up”. The truth is, no one knows what’s on your mind unless you tell them. Don’t let pride take you down a one-way street to loneliness and depression. Did you know loneliness, isolation, and depression are bigger killers than obesity? (Malito, 2017). This factor alone greatly increases your risk for “premature” death. A pretty sobering fact if you think about it.

Two, create barriers between your personal and work life. This can come in various forms such as leaving your laptop at work on the weekend, or after every workday if you can manage. If you must have your devices with you at all times, try not checking work email while at home, or at a minimum not checking it one hour before going to bed, nor within the same hour you awake in the morning. These are just a few of the many ways to decouple your work and personal life.

Third, establish barriers within your organization. Most of my career I’ve been at the mercy of what I call a “push” workload. Meaning work has always come to me, whether I want it to or not. It’s expensive on many fronts. It causes task switching, induces stress, and is inefficient in design, or lack thereof. At the heart of DevOps is the idea of designing inefficient workflow issues out of your organization. The same can be done on a personal level. Find small ways to create tension, the good kind, between you and those who summon your work. Establish protocols and command chains for who and when is pinged for certain tasks. In the same realm improve and utilize internal documentation tools to empower others to solve problems for their selves. These are simply a few of many ways to “shield” yourself so that you can stay focused and knock out more work with less effort.

For more on burnout and the impacts of stress on our environment, please see the following.


 

DevOps Stories –Jon Cwiak, Humana

The following content is shared from an interview with Jon Cwiak, Enterprise Cloud Platform Architect at Humana. What we loved about talking with Jon was his candor – he’s very honest and upfront that the story of Humana’s adoption of DevOps has not always been smooth, and the struggles and challenges they’re facing. Along the way we learned some eye-opening insights:

  • Having a DevOps team isn’t necessarily a bad thing
  • How can you break down walls and change very traditional mindsets or siloed groups?
  • How two metrics alone can tell you how your organization’s health
  • The Humana story as a practical roadmap, from version control to config management to feature toggles and microservices
  • The power of laziness as a positive career trait!

We loved our talk with Jon and wanted to share his thoughts with the community. Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

My name is Jon Cwiak – I’m an enterprise software architect on our enterprise DevOps enablement team at Humana, a large health insurance company based out of Kentucky. We are in the midst of a translation from the traditional insurance business into what amounts to a software company specializing in wellness and population health.

Our main function is to promote the right practices among our engineering teams. So I spend a big part of each week reinforcing to groups the need for hygiene – that old cliché about going slow to go fast. Things like branching strategy, version control, configuration management, dependency management – those things aren’t sexy but we’ve got to get it right.

Some of our teams though have been doing work in a particular way for 15 years; it’s extraordinarily hard to change these indoctrinated patterns. What we are finding is, we succeed if we show we are adding value. Even with these long-standing teams, once they see how a stable release pipeline can eliminate so much repetitive work from their lives, we begin to make some progress.

We are a little different in that there was no trumpet call of “doing DevOps” from on high – instead it was crowdsourced. Over the past 5 years, different teams in the org have independently found a need to deliver products and services to the org at a faster cadence. It’s been said that software is about two things – building the right thing and building the thing right. My group’s mission is all about that second part – we provide the framework, all the tools, platforms, architectural patterns and guidance on how to deliver cheaper, faster, smarter.

The big picture that’s changed for us as a company is the realization that doing this big-bang, waterfall, shipping everything in 9 months or more mega-events just doesn’t cut it anymore. We used to do those vast releases – a huge flow of bits like water, we called it a tsunami release. Well just like with a real tsunami there’s a wave of devastation after the delivery of these large platforms all at once that can take months of cleanup. We’ve changed from tsunami thinking to ripples with much faster, more frequent releases.

When the team first started up in 2012, the first thing we noticed was that everything was manual. And I mean everything – change requests, integration activity, testing. There was lots of handoffs, lots of Conway’s law at work.

So we started with the basics. For us, that was getting version control right – starting with basic hygiene practices, doing things in ways that decouple you from the way in which the software is being delivered. Just as an example, we used to label releases based on the year, quarter and month where a release was targeted for. So if suddenly a feature wasn’t needed – just complete integration hell. Lots of merges, lots of drama as we were backing things out. So we moved toward semantic versioning, where products are versioned regardless of when they’re delivered. Since this involved dozens of products and a lot of reorganization, getting version control right took the better part of 6 months for us. But it absolutely was the ground level for us being able to go fast.

Next up was fixing the way the devs worked. We had absolutely no confidence in the build process because it was xcopy manual deployments – so there was no visibility, no accountability, and no traceability. This worked great for the developers, but terrible for everyone else having to struggle with “it works on my machine!” So, continuous integration was the next rung on the ladder, and we started with a real enterprise build server. Getting to a common build system was enormously painful for us; don’t kid yourself that it’s easy. It exposed, application by application, all the gaps in our version control, a lot of hidden work we had to race to keep ahead of. But once the smoke cleared, we’d eliminated an entire category of work. Now, version control was the source of fact, the build server artifacts were reliable and complete. Finally we had a repeatable build system that we could trust.

The third rung of the ladder was configuration management. It took some bold steps to get our infrastructure under control. Each application had its own unique and beautiful configuration, and no two environments were alike – dev, QA, test, production, they were all different. Trying to figure out where these artifacts were and what the proper source of truth was required a lot of weekends playing “Where’s Waldo”! Introducing practices like configuration transforms gave us confidence we could deploy repeatedly and get the same behavior and it really helped us enforce some consistency. The movement toward a standardized infrastructure – no snowflakes, everything the same, infrastructure as code – has been a key enabler for fighting the config drift monster.

The data layer has been one of the later pieces to the puzzle for us. With our move to the cloud, we can’t wait for the thumb of approval from a DBA working apart from the team. So teams are putting their database under version control, building and generating deployable packages through DACPACs or ReadyRoll, and the data layer just becomes another part of the release pipeline. I think over time that traditional role of the DBA will change and we’ll see each team having a data steward and possibly a database developer; it’s still a specialized need and we need to know when a data type change will cause performance issues for example, but the skillset itself will get federated out.

Using feature toggles changes the way we view change management. We’ve always viewed delivery as the release of something. Now we can say, the deployment and the release are two different activities. Just because I deploy something doesn’t mean it has to be turned on. We used to view releases as a change, which means we needed to manage them as a risk. Feature toggles flips the switch on this where we say, deployments can happen early and often and releases can happen at a different cadence that we can control, safely. What a game-changer that is!

COTS products and DevOps totally go together. Think about it from an ERP perspective – where you need to deliver customizations to an ERP system, or Salesforce.com or whatever BI platform you’re using. The problem is, these systems weren’t designed in most cases to be delivered in an agile fashion. These are all big bang releases, with lots of drama, where any kind of meaningful customization is near taboo because it’ll break your next release. To bridge this gap, we tell people not to change but add – add your capabilities and customizations as a service, and then invoke thru a middleware platform. So you don’t change something that exists, you add new capabilities and point to it.

Gartner’s concept of bimodal IT I struggle with, quite frankly. It’s true you can’t have a one size fits all risk management strategy – you don’t want a lightweight website going through the long review period you might need with a legacy mainframe system of record for example. But the whole concept that you have this bifurcated path of one team moving at this fast pace, and another core system at this glacial pace – that’s just a copout I think, an excuse to avoid the modern expectations of platform delivery.

We do struggle with long lived feature branches. It’s a recurring pain point for us, we call it the integration credit card that teams charge to and it inevitably leads to drama at release time, some really long weekends. In a lot of cases the team knows this is bad practice and they definitely want to avoid it, but because of cross dependencies we have these long-lived branches. The other issue is contention, which usually is an architecture issue. We’re moving towards a one repo, one build pipeline and decomposing software down to its constituent parts to try to reduce this, but decoupling these artifacts is not an overnight kind of thing.

The big blocker for most organizations seems to be testing. Developers want to move at speed, but the way we test – usually manually – and our lack of investment in automated unit tests creates these long test cycles which in turn spawns these long-lived release branches. The obvious antidote are feature toggles to decouple deployment from delivery.

I gave a talk a few years back called “King Tut Testing” where we used Mike Cohn’s testing pyramid to talk about where we should be investing in our testing. We are still in the process of inverting that pyramid – moving away from integration testing and lessening functional testing, and fattening up that unit testing layer. A big part of the journey for us is designing architectures so that they are inherently testable, mockable. I’m more interested in test driven design than I am test driven development personally, because it forces me to think in terms of – how am I going to test this? What are my dependencies, how can I fake or mock them so that the software is verifiable. The carrot I use in talking about this shift and convincing them to invest in unit testing is, not only is this your safety net, it’s a living, breathing definition of what the software does. So for example, when you get a new person on the team, instead of weeks of manual onboarding, you use the working test harness to introduce them to how the software behaves and give them a comfort level in making modifications safely.

The books don’t stress enough how difficult this is. There’s just not the ROI to support creating a fully functional set of tests with a brownfield software package in most cases. So you start with asking, where does this hurt most? – using telemetry or tools like SonarQube. And then you invest in slowing down, then stopping the bleeding.

Operations support in many organizations tends to be more about resource utilization and cost accounting – how do I best utilize this support person so he’s 100% busy? And we have ticketing systems that create a constant source of work for Operations and activity. The problem with this siloed thinking is that the goal is no longer developing the best software possible and providing useful feedback, it’s now closing a ticket as fast as possible. We’re shifting that model with our move to microservices to teams that own the product and are responsible for maintaining and supporting it end to end.

Lots of vendors are trying to sell DevOps In A Box – buy this product, magic will happen. But they don’t like to talk about all the unsexy things that need to be done to make DevOps successful – four years to clean up version control, for example. It’s kind of a land grab right now with tooling – some of those tools are great in unicorn space but not so well with teams that were using long lived feature branches.

Every year we do an internal DevOps Day, and that’s been so great for us in spreading enthusiasm. I highly recommend it. The subject of the definition of DevOps inevitably comes up. We like Donovan Brown’s definition and that’s our standard – one of the things I will add is, DevOps is an emergent characteristic. It’s not something you buy, not something you do. It’s something that emerges from a team when you are doing all the right things behind the scenes, and these practices all work together and support each other.

There’s lots of metrics to choose from, but two metrics stand out – and they’re not new or shocking. Lead time and cycle time. Those two are the standard we always fall back on, and the only way we can tell if we’re really making progress. They won’t tell us where we have constraints, but it does tell us which parts of the org are having problems. We’re going after those with every fiber of our effort. There’s other line of sight metrics, but those two are dominant in determining how things are going.

We do value stream analysis and map out our cycle time, our wait time, and handoffs. It’s an incredibly useful tool in terms of being a bucket of cold water right to the face – it exposes the ridiculous amount of effort being wasted in doing things manually. That exercise has been critical in helping prove why we need to change the way we do things. Its specific, quantitative – people see the numbers and get immediately why waiting two weeks for someone to push a button is unacceptable. Until they see the numbers, it always seems to be emotional.

A consistent definition of done – well, we’re getting there. Giving people 300 page binders, or a checklist, or templated tasks so developers have to check boxes – we’ve tried them all, and they’re just not sustainable. The model that seems to work is where the team is self policing, where a continuous review is happening of what other people on the team are doing. That kind of group accountability is so much better than any checklist. You have to be careful though – it’s successful if the culture supports these reviews as a learning opportunity, a public speaking opportunity, a chance to show and tell. In the wrong culture, peer reviews or code demos becomes a kind of group beat-down where we are criticizing and nitpicking other people’s investment.

A DevOps team isn’t an antipattern like people say. Centralizing the work is not scalable, that is definitely an antipattern. But I love the mission our team has, enabling other groups to go faster. It’s kind of like being a consulting team – architectural guidance and consulting, practices. It’s incredibly rewarding to help foster this growing culture within our company, we are seeing this kind of organic center of excellence spring up.

What I like to tell people is, be like the best developers out there, and be incredibly selfish and lazy. If you’re selfish, you invest in yourself – improving your skillset, in the things that will give you a long-term advantage. If you’re lazy, you don’t want to work harder than you have to. So you automate things to save yourself time. Learning and automation are two very nice side effects of being lazy and selfish, and it’s a great survival trait!

 

DevOps Stories – Aaron Bjork, Microsoft

Many people ask us how Microsoft accomplished our transformation with DevOps. Our interview with Aaron Bjork, Principal Group Program Manager for VSTS (Visual Studio Team Services) at Microsoft, opened up some valuable lessons that could be applied to any large enterprise trying to transform the way they deliver value and get feedback faster. This interview has previously been posted on the Microsoft Premier official blog here.

These and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Contact me if you’d like an advance copy!

 

 

I just want to stress that you can’t follow what we did on the Visual Studio Team Services (VSTS) team like a prescription. There’s not another product in the world like ours; it would be foolish for me to say, you should exactly do it our way.

That being said, I do see some common elements in teams that successfully make the jump in DevOps:

  1. Have a single cadence across all your teams. I haven’t seen a single place yet where that won’t apply. Your teams within that cadence can have significant freedom and autonomy, but we want everyone to be dancing to the same beat.
  2. Ship at the end of each sprint. The saying we live by goes – “You can’t cheat shipping.” If you deliver working software to your users at the end of every iteration, you’ll learn what it takes to do that and which pieces you’ll need to automate. If you don’t ship at the end of each iteration, human nature kicks in and we start to delay, to procrastinate. Shipping at the end of a sprint is comfy and righteous and produces the right behaviors.
  3. We same-size our teams. Every team has a consistent size and shape – about 8-12 people, working across the stack all the way to production support. This helps not just with delivering value faster in incremental sizes, but gives us a common taxonomy so we can work across teams at scale. Whenever we break that rule – teams that are smaller than that, or bloat out to 20 people for example – we start to see anti-patterns crop up; resource horse-trading and things like that. I love the “two pizza rule” at Amazon; there’s no reason not to use that approach, ever.
  4. Have each team own their features as a product. Our teams own their features in production. If you start having siloed support or operations teams running things in production, almost immediately you start to see disruption in continuity and other bad behaviors. It doesn’t motivate people to ship quality and deliver end to end capabilities to users; instead it becomes a “not it” game.

In handling support, our teams each sprint are broken up into an “F” and an “L” team. The F team is focused on new features; the L team is focused on disruptions and lifecycle. We rotate these people, so every sprint a different pair of engineers are handling bugfixes and interruptions, and the other 10 new feature work. This helps people schedule their lives when they’re on call.

We’ve gone through a big movement in the past few years where we took our entire test bed, which was largely automated UI focused, and not a lot of unit testing, and flipped it on its head. Now we are running much fewer automated UI tests and a ton of what we call L1 and L2 tests, which are essentially unit tests at the lowest levels checking components and end to end capabilities. This allows us to run through our test cycle much faster, like every commit. I think you still have to do some level of acceptance testing; just determine what level works for your software base and helps drive quality.

We started to deploy at the end of every 3 weeks instead of twice a year. Another thing was, we moved everyone into the same building and reporting up to the same structure/org. The folks that run our ops are a part of our leadership team just like our engineering and program management team – all under the same umbrella. This started getting everyone bought into shared goals we have. We have monthly business reviews, where we talk about more than just the technical goals but financial, operations, bug health, not just code. This helps us align on the same goal, bringing people into same umbrella so we are invested in the other side, if you will.

Our teams own features in production – we hire engineers who write code, test code, deploy code, and support code. In the end that’s devops. Now our folks have a relationship with the people handling support – they have to. If you start with that setup – the rest falls into place. If you have separate groups, each responsible for a piece of the puzzle – that’s a recipe for not succeeding, in my view.

Branching is similar where we don’t have long-lived branches at all. We do have a release branch; our engineers check out their work from mainline though and they check in their short-lived branches direct to main. In general, I’d say people are checking their changes into their user branch every day; every other day they submit a pull request to integrate their user branch back to main. The team handles all merge issues internally; everything is validated that it works before its checked in.

When I think about how we handle releases, a couple things come to mind. First, we want to minimize the time that any code is written is in isolation. We used to have the mindset – at beginning of each sprint, teams would check their code into a feature branch and then integrate back at end of sprint. The problem with this is, the longer you stay away from master, the harder it is to integrate and you pay a massive tax with merge issues. We want to check into master continuously, that’s a very important construct for us. Second, we wanted to get into mindset that when a feature is ready, it’s easy to put it into production. Instead of the idea that we will put a new feature into production when its 100% ready, move to where the feature is ALWAYS being put into prod. We were trying to get out of engineering mechanics – something we were constantly having to manage, where I felt it should be more a consistent, without thinking kind of mechanical movement. Now our mechanics are the same whether something is a bug, a critsit incident or a new feature – and we do it without thinking. Getting to that model and think that way required some change – but now, we’re always writing code, always deploying code. Feature flags were a big help to us where we felt like we can turn on access to a new feature when we’re ready – it’s safe, controlled.

Pair programming is accepted widely as a best practice; it’s also a culture that shapes how we write code. The interesting thing here is we don’t mandate pair programming. We do teach it; some of our teams have embraced pair programming and it works great for them, always writing in tandem. Other teams have tried it, and it just hasn’t fit. We do enforce consistency on some things across our 40 different teams; others we let the team decide. Pair programming and XP practices are one thing we leave up to the devs; we treat them as adults and don’t shove one way of thinking down their throats.

Another big help to us is a kind of team of teams meeting, which we have once every sprint. This is not a “get everybody in the room” type of meeting but its very focused, about 4-6 people in the room, each representing their team. We don’t talk about what we’re doing now, but what we’re working on three sprints ahead. It always amazes me how many “A-Ha!” moments we have during these meetings. It really helps expose points of dependency that we weren’t aware of; “Hmm, we should probably synch up and make sure we have a shared point of view”. In my view this is very agile; its lightweight, just enough to accomplish the purpose.

We do track one metric that is very telling – the number of defects a team has. We call this the bug cap. You just take the number of engineers and multiply it by 4 – so if your team has 10 engineers, your bug cap is 40. We operate under a simple rule – if your bug count is above this bug cap, then in the next sprint you need to slow down and pay down that debt. This helps us fight the tendency to let technical debt pile up and be a boat anchor you’re dragging everywhere and having to fight against. With continuous delivery, you just can’t let that debt creep up on you like that. We have no dedicated time to work on debt – but we do monitor the bug cap and let each team manage it as they see best. I check this number all the time, and if we see that number go above the limit we have a discussion and find out if there’s a valid reason for that debt pileup and what the plan is to remedy. Here we don’t allow any team to accrue a significant debt but we pay it off like you would a credit card – instead of making the minimum payment though we’re paying off the majority of the balance, every pay period. It’s often not realistic to say “Zero bugs” – some defects may just not be that urgent or shouldn’t come ahead of a hot new feature work in priority. This allows us to keep technical debt to reasonable number and still focus on delivering new capabilities.

We have an engineering scorecard that’s visible to everyone but we’re very careful about what we put on that. Our measurements are very carefully chosen and we don’t give teams 20 things to work on – that’s overwhelming. With every metric that you start to measure, you’re going to get a behavior – and maybe some bad ones you weren’t expecting. We see a lot of companies trying to track and improve everything, which seems to be overburdening teams – no one wants to see a scorecard with 20 red buttons on it!

Agile is a culture more than anything else but – I’m going to be frank – too many people have turned it into a religion, a stone tablet with a bunch of “thou shalts” on it. Some organizations we’ve worked with for example bring in multiple rounds of expensive consultants and agile trainers, and they’re given an audit. “Oh, you’re not doing DSU’s, your sprint planning meeting doesn’t have the right amount of ceremony, blah blah.” This makes me a laugh a little. Do I think daily standups are good practice? Yes, I do. But I’m not going to measure a team’s efficiency by these things. If the team is struggling producing business value, then we might bring in some of these practices. But it is SO shortsighted to say if you follow these practices following this recipe you’ll be successful. I don’t allow people to start telling me “we need to do things Agile.” There’s just no such thing. Talk to me about what you want to achieve, the business value you want to drive, and that’s our starting point.

Just because you have a DSU doesn’t mean you’re making the right decisions. Because you’re using containers or adopted microservices doesn’t mean you’re doing DevOps. Maybe you’re better set up to do Agile or DevOps because of these tools, but nothing really has changed. Agile’s very simple and beautiful as a mindset – we are going to deploy as frequently as we can. Too often we turn it into a set of rules you have to follow.

 


(from Aaron’s presentation deck on “Agile at Microsoft”, https://www.youtube.com/watch?v=-LvCJpnNljU )

 

 

Note: if you want more on this story and how Microsoft went about their transformation, check out Aaron’s presentation here – 41 minutes that could very well change your whole view of how to go about your own transformation. It remains one of the best real-world encapsulations of DevOps that I’ve ever seen. Some more DevOps stories from the Visual Studio team are here, including our understanding of what DevOps is and Munil Shah’s excellent thoughts on “shifting Left” with our test infrastructure.