Month: May 2018

DevOps Stories –Jon Cwiak, Humana

The following content is shared from an interview with Jon Cwiak, Enterprise Cloud Platform Architect at Humana. What we loved about talking with Jon was his candor – he’s very honest and upfront that the story of Humana’s adoption of DevOps has not always been smooth, and the struggles and challenges they’re facing. Along the way we learned some eye-opening insights:

  • Having a DevOps team isn’t necessarily a bad thing
  • How can you break down walls and change very traditional mindsets or siloed groups?
  • How two metrics alone can tell you how your organization’s health
  • The Humana story as a practical roadmap, from version control to config management to feature toggles and microservices
  • The power of laziness as a positive career trait!

We loved our talk with Jon and wanted to share his thoughts with the community. Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

My name is Jon Cwiak – I’m an enterprise software architect on our enterprise DevOps enablement team at Humana, a large health insurance company based out of Kentucky. We are in the midst of a translation from the traditional insurance business into what amounts to a software company specializing in wellness and population health.

Our main function is to promote the right practices among our engineering teams. So I spend a big part of each week reinforcing to groups the need for hygiene – that old cliché about going slow to go fast. Things like branching strategy, version control, configuration management, dependency management – those things aren’t sexy but we’ve got to get it right.

Some of our teams though have been doing work in a particular way for 15 years; it’s extraordinarily hard to change these indoctrinated patterns. What we are finding is, we succeed if we show we are adding value. Even with these long-standing teams, once they see how a stable release pipeline can eliminate so much repetitive work from their lives, we begin to make some progress.

We are a little different in that there was no trumpet call of “doing DevOps” from on high – instead it was crowdsourced. Over the past 5 years, different teams in the org have independently found a need to deliver products and services to the org at a faster cadence. It’s been said that software is about two things – building the right thing and building the thing right. My group’s mission is all about that second part – we provide the framework, all the tools, platforms, architectural patterns and guidance on how to deliver cheaper, faster, smarter.

The big picture that’s changed for us as a company is the realization that doing this big-bang, waterfall, shipping everything in 9 months or more mega-events just doesn’t cut it anymore. We used to do those vast releases – a huge flow of bits like water, we called it a tsunami release. Well just like with a real tsunami there’s a wave of devastation after the delivery of these large platforms all at once that can take months of cleanup. We’ve changed from tsunami thinking to ripples with much faster, more frequent releases.

When the team first started up in 2012, the first thing we noticed was that everything was manual. And I mean everything – change requests, integration activity, testing. There was lots of handoffs, lots of Conway’s law at work.

So we started with the basics. For us, that was getting version control right – starting with basic hygiene practices, doing things in ways that decouple you from the way in which the software is being delivered. Just as an example, we used to label releases based on the year, quarter and month where a release was targeted for. So if suddenly a feature wasn’t needed – just complete integration hell. Lots of merges, lots of drama as we were backing things out. So we moved toward semantic versioning, where products are versioned regardless of when they’re delivered. Since this involved dozens of products and a lot of reorganization, getting version control right took the better part of 6 months for us. But it absolutely was the ground level for us being able to go fast.

Next up was fixing the way the devs worked. We had absolutely no confidence in the build process because it was xcopy manual deployments – so there was no visibility, no accountability, and no traceability. This worked great for the developers, but terrible for everyone else having to struggle with “it works on my machine!” So, continuous integration was the next rung on the ladder, and we started with a real enterprise build server. Getting to a common build system was enormously painful for us; don’t kid yourself that it’s easy. It exposed, application by application, all the gaps in our version control, a lot of hidden work we had to race to keep ahead of. But once the smoke cleared, we’d eliminated an entire category of work. Now, version control was the source of fact, the build server artifacts were reliable and complete. Finally we had a repeatable build system that we could trust.

The third rung of the ladder was configuration management. It took some bold steps to get our infrastructure under control. Each application had its own unique and beautiful configuration, and no two environments were alike – dev, QA, test, production, they were all different. Trying to figure out where these artifacts were and what the proper source of truth was required a lot of weekends playing “Where’s Waldo”! Introducing practices like configuration transforms gave us confidence we could deploy repeatedly and get the same behavior and it really helped us enforce some consistency. The movement toward a standardized infrastructure – no snowflakes, everything the same, infrastructure as code – has been a key enabler for fighting the config drift monster.

The data layer has been one of the later pieces to the puzzle for us. With our move to the cloud, we can’t wait for the thumb of approval from a DBA working apart from the team. So teams are putting their database under version control, building and generating deployable packages through DACPACs or ReadyRoll, and the data layer just becomes another part of the release pipeline. I think over time that traditional role of the DBA will change and we’ll see each team having a data steward and possibly a database developer; it’s still a specialized need and we need to know when a data type change will cause performance issues for example, but the skillset itself will get federated out.

Using feature toggles changes the way we view change management. We’ve always viewed delivery as the release of something. Now we can say, the deployment and the release are two different activities. Just because I deploy something doesn’t mean it has to be turned on. We used to view releases as a change, which means we needed to manage them as a risk. Feature toggles flips the switch on this where we say, deployments can happen early and often and releases can happen at a different cadence that we can control, safely. What a game-changer that is!

COTS products and DevOps totally go together. Think about it from an ERP perspective – where you need to deliver customizations to an ERP system, or Salesforce.com or whatever BI platform you’re using. The problem is, these systems weren’t designed in most cases to be delivered in an agile fashion. These are all big bang releases, with lots of drama, where any kind of meaningful customization is near taboo because it’ll break your next release. To bridge this gap, we tell people not to change but add – add your capabilities and customizations as a service, and then invoke thru a middleware platform. So you don’t change something that exists, you add new capabilities and point to it.

Gartner’s concept of bimodal IT I struggle with, quite frankly. It’s true you can’t have a one size fits all risk management strategy – you don’t want a lightweight website going through the long review period you might need with a legacy mainframe system of record for example. But the whole concept that you have this bifurcated path of one team moving at this fast pace, and another core system at this glacial pace – that’s just a copout I think, an excuse to avoid the modern expectations of platform delivery.

We do struggle with long lived feature branches. It’s a recurring pain point for us, we call it the integration credit card that teams charge to and it inevitably leads to drama at release time, some really long weekends. In a lot of cases the team knows this is bad practice and they definitely want to avoid it, but because of cross dependencies we have these long-lived branches. The other issue is contention, which usually is an architecture issue. We’re moving towards a one repo, one build pipeline and decomposing software down to its constituent parts to try to reduce this, but decoupling these artifacts is not an overnight kind of thing.

The big blocker for most organizations seems to be testing. Developers want to move at speed, but the way we test – usually manually – and our lack of investment in automated unit tests creates these long test cycles which in turn spawns these long-lived release branches. The obvious antidote are feature toggles to decouple deployment from delivery.

I gave a talk a few years back called “King Tut Testing” where we used Mike Cohn’s testing pyramid to talk about where we should be investing in our testing. We are still in the process of inverting that pyramid – moving away from integration testing and lessening functional testing, and fattening up that unit testing layer. A big part of the journey for us is designing architectures so that they are inherently testable, mockable. I’m more interested in test driven design than I am test driven development personally, because it forces me to think in terms of – how am I going to test this? What are my dependencies, how can I fake or mock them so that the software is verifiable. The carrot I use in talking about this shift and convincing them to invest in unit testing is, not only is this your safety net, it’s a living, breathing definition of what the software does. So for example, when you get a new person on the team, instead of weeks of manual onboarding, you use the working test harness to introduce them to how the software behaves and give them a comfort level in making modifications safely.

The books don’t stress enough how difficult this is. There’s just not the ROI to support creating a fully functional set of tests with a brownfield software package in most cases. So you start with asking, where does this hurt most? – using telemetry or tools like SonarQube. And then you invest in slowing down, then stopping the bleeding.

Operations support in many organizations tends to be more about resource utilization and cost accounting – how do I best utilize this support person so he’s 100% busy? And we have ticketing systems that create a constant source of work for Operations and activity. The problem with this siloed thinking is that the goal is no longer developing the best software possible and providing useful feedback, it’s now closing a ticket as fast as possible. We’re shifting that model with our move to microservices to teams that own the product and are responsible for maintaining and supporting it end to end.

Lots of vendors are trying to sell DevOps In A Box – buy this product, magic will happen. But they don’t like to talk about all the unsexy things that need to be done to make DevOps successful – four years to clean up version control, for example. It’s kind of a land grab right now with tooling – some of those tools are great in unicorn space but not so well with teams that were using long lived feature branches.

Every year we do an internal DevOps Day, and that’s been so great for us in spreading enthusiasm. I highly recommend it. The subject of the definition of DevOps inevitably comes up. We like Donovan Brown’s definition and that’s our standard – one of the things I will add is, DevOps is an emergent characteristic. It’s not something you buy, not something you do. It’s something that emerges from a team when you are doing all the right things behind the scenes, and these practices all work together and support each other.

There’s lots of metrics to choose from, but two metrics stand out – and they’re not new or shocking. Lead time and cycle time. Those two are the standard we always fall back on, and the only way we can tell if we’re really making progress. They won’t tell us where we have constraints, but it does tell us which parts of the org are having problems. We’re going after those with every fiber of our effort. There’s other line of sight metrics, but those two are dominant in determining how things are going.

We do value stream analysis and map out our cycle time, our wait time, and handoffs. It’s an incredibly useful tool in terms of being a bucket of cold water right to the face – it exposes the ridiculous amount of effort being wasted in doing things manually. That exercise has been critical in helping prove why we need to change the way we do things. Its specific, quantitative – people see the numbers and get immediately why waiting two weeks for someone to push a button is unacceptable. Until they see the numbers, it always seems to be emotional.

A consistent definition of done – well, we’re getting there. Giving people 300 page binders, or a checklist, or templated tasks so developers have to check boxes – we’ve tried them all, and they’re just not sustainable. The model that seems to work is where the team is self policing, where a continuous review is happening of what other people on the team are doing. That kind of group accountability is so much better than any checklist. You have to be careful though – it’s successful if the culture supports these reviews as a learning opportunity, a public speaking opportunity, a chance to show and tell. In the wrong culture, peer reviews or code demos becomes a kind of group beat-down where we are criticizing and nitpicking other people’s investment.

A DevOps team isn’t an antipattern like people say. Centralizing the work is not scalable, that is definitely an antipattern. But I love the mission our team has, enabling other groups to go faster. It’s kind of like being a consulting team – architectural guidance and consulting, practices. It’s incredibly rewarding to help foster this growing culture within our company, we are seeing this kind of organic center of excellence spring up.

What I like to tell people is, be like the best developers out there, and be incredibly selfish and lazy. If you’re selfish, you invest in yourself – improving your skillset, in the things that will give you a long-term advantage. If you’re lazy, you don’t want to work harder than you have to. So you automate things to save yourself time. Learning and automation are two very nice side effects of being lazy and selfish, and it’s a great survival trait!

 

DevOps Stories – Aaron Bjork, Microsoft

Many people ask us how Microsoft accomplished our transformation with DevOps. Our interview with Aaron Bjork, Principal Group Program Manager for VSTS (Visual Studio Team Services) at Microsoft, opened up some valuable lessons that could be applied to any large enterprise trying to transform the way they deliver value and get feedback faster. This interview has previously been posted on the Microsoft Premier official blog here.

These and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Contact me if you’d like an advance copy!

 

 

I just want to stress that you can’t follow what we did on the Visual Studio Team Services (VSTS) team like a prescription. There’s not another product in the world like ours; it would be foolish for me to say, you should exactly do it our way.

That being said, I do see some common elements in teams that successfully make the jump in DevOps:

  1. Have a single cadence across all your teams. I haven’t seen a single place yet where that won’t apply. Your teams within that cadence can have significant freedom and autonomy, but we want everyone to be dancing to the same beat.
  2. Ship at the end of each sprint. The saying we live by goes – “You can’t cheat shipping.” If you deliver working software to your users at the end of every iteration, you’ll learn what it takes to do that and which pieces you’ll need to automate. If you don’t ship at the end of each iteration, human nature kicks in and we start to delay, to procrastinate. Shipping at the end of a sprint is comfy and righteous and produces the right behaviors.
  3. We same-size our teams. Every team has a consistent size and shape – about 8-12 people, working across the stack all the way to production support. This helps not just with delivering value faster in incremental sizes, but gives us a common taxonomy so we can work across teams at scale. Whenever we break that rule – teams that are smaller than that, or bloat out to 20 people for example – we start to see anti-patterns crop up; resource horse-trading and things like that. I love the “two pizza rule” at Amazon; there’s no reason not to use that approach, ever.
  4. Have each team own their features as a product. Our teams own their features in production. If you start having siloed support or operations teams running things in production, almost immediately you start to see disruption in continuity and other bad behaviors. It doesn’t motivate people to ship quality and deliver end to end capabilities to users; instead it becomes a “not it” game.

In handling support, our teams each sprint are broken up into an “F” and an “L” team. The F team is focused on new features; the L team is focused on disruptions and lifecycle. We rotate these people, so every sprint a different pair of engineers are handling bugfixes and interruptions, and the other 10 new feature work. This helps people schedule their lives when they’re on call.

We’ve gone through a big movement in the past few years where we took our entire test bed, which was largely automated UI focused, and not a lot of unit testing, and flipped it on its head. Now we are running much fewer automated UI tests and a ton of what we call L1 and L2 tests, which are essentially unit tests at the lowest levels checking components and end to end capabilities. This allows us to run through our test cycle much faster, like every commit. I think you still have to do some level of acceptance testing; just determine what level works for your software base and helps drive quality.

We started to deploy at the end of every 3 weeks instead of twice a year. Another thing was, we moved everyone into the same building and reporting up to the same structure/org. The folks that run our ops are a part of our leadership team just like our engineering and program management team – all under the same umbrella. This started getting everyone bought into shared goals we have. We have monthly business reviews, where we talk about more than just the technical goals but financial, operations, bug health, not just code. This helps us align on the same goal, bringing people into same umbrella so we are invested in the other side, if you will.

Our teams own features in production – we hire engineers who write code, test code, deploy code, and support code. In the end that’s devops. Now our folks have a relationship with the people handling support – they have to. If you start with that setup – the rest falls into place. If you have separate groups, each responsible for a piece of the puzzle – that’s a recipe for not succeeding, in my view.

Branching is similar where we don’t have long-lived branches at all. We do have a release branch; our engineers check out their work from mainline though and they check in their short-lived branches direct to main. In general, I’d say people are checking their changes into their user branch every day; every other day they submit a pull request to integrate their user branch back to main. The team handles all merge issues internally; everything is validated that it works before its checked in.

When I think about how we handle releases, a couple things come to mind. First, we want to minimize the time that any code is written is in isolation. We used to have the mindset – at beginning of each sprint, teams would check their code into a feature branch and then integrate back at end of sprint. The problem with this is, the longer you stay away from master, the harder it is to integrate and you pay a massive tax with merge issues. We want to check into master continuously, that’s a very important construct for us. Second, we wanted to get into mindset that when a feature is ready, it’s easy to put it into production. Instead of the idea that we will put a new feature into production when its 100% ready, move to where the feature is ALWAYS being put into prod. We were trying to get out of engineering mechanics – something we were constantly having to manage, where I felt it should be more a consistent, without thinking kind of mechanical movement. Now our mechanics are the same whether something is a bug, a critsit incident or a new feature – and we do it without thinking. Getting to that model and think that way required some change – but now, we’re always writing code, always deploying code. Feature flags were a big help to us where we felt like we can turn on access to a new feature when we’re ready – it’s safe, controlled.

Pair programming is accepted widely as a best practice; it’s also a culture that shapes how we write code. The interesting thing here is we don’t mandate pair programming. We do teach it; some of our teams have embraced pair programming and it works great for them, always writing in tandem. Other teams have tried it, and it just hasn’t fit. We do enforce consistency on some things across our 40 different teams; others we let the team decide. Pair programming and XP practices are one thing we leave up to the devs; we treat them as adults and don’t shove one way of thinking down their throats.

Another big help to us is a kind of team of teams meeting, which we have once every sprint. This is not a “get everybody in the room” type of meeting but its very focused, about 4-6 people in the room, each representing their team. We don’t talk about what we’re doing now, but what we’re working on three sprints ahead. It always amazes me how many “A-Ha!” moments we have during these meetings. It really helps expose points of dependency that we weren’t aware of; “Hmm, we should probably synch up and make sure we have a shared point of view”. In my view this is very agile; its lightweight, just enough to accomplish the purpose.

We do track one metric that is very telling – the number of defects a team has. We call this the bug cap. You just take the number of engineers and multiply it by 4 – so if your team has 10 engineers, your bug cap is 40. We operate under a simple rule – if your bug count is above this bug cap, then in the next sprint you need to slow down and pay down that debt. This helps us fight the tendency to let technical debt pile up and be a boat anchor you’re dragging everywhere and having to fight against. With continuous delivery, you just can’t let that debt creep up on you like that. We have no dedicated time to work on debt – but we do monitor the bug cap and let each team manage it as they see best. I check this number all the time, and if we see that number go above the limit we have a discussion and find out if there’s a valid reason for that debt pileup and what the plan is to remedy. Here we don’t allow any team to accrue a significant debt but we pay it off like you would a credit card – instead of making the minimum payment though we’re paying off the majority of the balance, every pay period. It’s often not realistic to say “Zero bugs” – some defects may just not be that urgent or shouldn’t come ahead of a hot new feature work in priority. This allows us to keep technical debt to reasonable number and still focus on delivering new capabilities.

We have an engineering scorecard that’s visible to everyone but we’re very careful about what we put on that. Our measurements are very carefully chosen and we don’t give teams 20 things to work on – that’s overwhelming. With every metric that you start to measure, you’re going to get a behavior – and maybe some bad ones you weren’t expecting. We see a lot of companies trying to track and improve everything, which seems to be overburdening teams – no one wants to see a scorecard with 20 red buttons on it!

Agile is a culture more than anything else but – I’m going to be frank – too many people have turned it into a religion, a stone tablet with a bunch of “thou shalts” on it. Some organizations we’ve worked with for example bring in multiple rounds of expensive consultants and agile trainers, and they’re given an audit. “Oh, you’re not doing DSU’s, your sprint planning meeting doesn’t have the right amount of ceremony, blah blah.” This makes me a laugh a little. Do I think daily standups are good practice? Yes, I do. But I’m not going to measure a team’s efficiency by these things. If the team is struggling producing business value, then we might bring in some of these practices. But it is SO shortsighted to say if you follow these practices following this recipe you’ll be successful. I don’t allow people to start telling me “we need to do things Agile.” There’s just no such thing. Talk to me about what you want to achieve, the business value you want to drive, and that’s our starting point.

Just because you have a DSU doesn’t mean you’re making the right decisions. Because you’re using containers or adopted microservices doesn’t mean you’re doing DevOps. Maybe you’re better set up to do Agile or DevOps because of these tools, but nothing really has changed. Agile’s very simple and beautiful as a mindset – we are going to deploy as frequently as we can. Too often we turn it into a set of rules you have to follow.

 


(from Aaron’s presentation deck on “Agile at Microsoft”, https://www.youtube.com/watch?v=-LvCJpnNljU )

 

 

Note: if you want more on this story and how Microsoft went about their transformation, check out Aaron’s presentation here – 41 minutes that could very well change your whole view of how to go about your own transformation. It remains one of the best real-world encapsulations of DevOps that I’ve ever seen. Some more DevOps stories from the Visual Studio team are here, including our understanding of what DevOps is and Munil Shah’s excellent thoughts on “shifting Left” with our test infrastructure.