Uncategorized

Interview with Betsy Beyer, Stephen Thorne of Google

Betsy is a Technical Writer for Google in NYC specializing in Site Reliability Engineering. She co-authored the books Site Reliability Engineering: How Google Runs Production Systems and The Site Reliability Workbook: Practical Ways to Implement SRE. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally-distributed data centers.

Stephen is a Site Reliability Engineer in Google’s London office. His book The
Site Reliability Workbook: Practical Ways to Implement SRE drew from his working at introducing SRE practices to Google customers on the Customer Reliability Engineering team. He has been an SRE at Google since 2011, and has previously worked on Google Ads, and Google App Engine.

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

————————————

Do we see Site Reliability Engineering (SRE) as the future of DevOps? Definitely not. DevOps is really a broad set of principles and practices; SRE is a very specific implementation of that. The two are not mutually exclusive. You can look at all these DevOps principles and achieve it by applying SRE.

In the book Accelerate, they separated out four key metrics that make for a successful team: lead time, MTTR, batch size, and change success rate. All of these metrics boil down to continuous delivery – how quickly can we make changes, and how fast can we recover when things go awry?

But we look too much at this desired outcome – we’re releasing very often, with a low rate of failure – and sometimes lose sight of how we get there. It’s that implementation space where SRE fills a gap. Applying SRE principles is like having a brake on the car; we can say, hey, there’s a corner coming up, slow down and let’s handle this safely. We’ll go slow and steady for a bit, then speed up once we’re on better ground.

We commonly hear people say, “Oh, SRE works for you, because you’re a Google – a massive, web-native company.” But there are SRE things you can do at any size of company, even without a dedicated SRE team. There are some patterns and principles we wanted to stress, which is why we wrote both SRE books. Particularly around how to manage and embrace risk, and how to establish a balanced SLO and an error budget policy. These things are fundamental to a well running SRE team, and it’s something your customers need.

Two Modes of Development: SRE’s don’t take direct responsibility for releases, and our job isn’t just to be a brake on development. In fact, we operate in two modes. The first one is if we’re consistently within that error SLO, and we’re not consuming enough of our error budget – that’s actually hampering our innovation. So SRE should be advocating for increasing the speed of the pipeline. Congratulations, we’re in that sweet spot that DevOps is aiming for, low friction releases – we are a well performing team.

But the second mode is often the default mode – and that’s not stepping on the gas, it’s the ability to slow down. If we’re constantly running out of the error budget, then we have to slow things down – our rate of failure is simply too high, it’s not sustainable. We have to do whatever it takes to make it more reliable, and not defer it as debt. That’s the fourth attribute we want with DevOps – a low rate of failures in our production releases.

Error Budgets: One of the most frequent questions we got after publishing our first book had to do with forming an error budget policy. It’s actually a concept that’s pretty easy to apply at other organizations.

You can’t get away from the fact that when it comes to instability, releases are one of the primary causes. If we stop or gate releases, the chance of a release causing a problem goes way down. If things are going fine, it’s the SRE’s job to call it out that we’re being TOO reliable – let’s take more risks. And then, when we’ve run out of error budget, we want to have a policy agreed upon in advance so we can slow down the train.

We’ve seen this error policy take a number of different shapes. At one company we engaged with that’s very Agile-focused, when they know a system isn’t meeting customer expectations, developers can only pull items off the backlog if they’re marked as a postmortem action item. Another company uses pair programming a lot. So during that error budget overage period – the second mode – they mandate that one pair must be devoted purely to reliability improvements.

Now that’s not how we do it at Google – but it’s totally effective. We see companies like Atlassian, IBM, and VMWare all using error budgets and talking about it in public forums. One thing is for sure though – this policy needs to be explicit, in writing, agreed upon in advance – and supported by management. It’s so, SO important to have this discussion before you have the incident.

Business stakeholders and execs sometimes fight for zero downtime, 100% availability. So let’s say you’re a mobile gaming platform. Any downtime for you means money lost and perhaps customers out the door. So, how are you going to get to 100% reliability? Your customers aren’t getting 99.9% reliability out of their phones. How are you going to fix that? Once you point out that people won’t even notice a small amount of downtime in all likelihood – you end up with a financial argument, which has an obvious answer. I can spend millions of dollars for nearly no noticeable result, or accept a more reasonable availability SLO and save that money and stress in trying to attain perfection.

A competent SRE embraces risk. Our goal is not to slow down or stop releases. It’s really about safety, not stability just for its own sake. Going back to that car analogy – If your goal is 100%, then the only thing we can do is jam on the brakes, immediately. That’s a terrible approach to driving if you think about it – it’s not getting you where you need to be. Even pacemaker companies aren’t 100% defect free; they have a documented, acknowledged failure rate. It might be one in 100M pacemakers that fail, but it still happens and that 99.9999% success rate is still the number they strive for.

Blameless postmortems: It’s counterproductive to blame other people, as it ends up hiding the truth. And it’s not just blaming people – it’s too easy sometimes to blame systems. This is an emotional thing – it gives us a warm fuzzy feeling to come up with one pat answer. But at that point we stop listing all the other factors that could have contributed to the problem.

We don’t perform postmortems for each and every incident at Google – only when we’re sure it has a root cause that’s accurate, and it could be applicable to other outages – something we can actually learn from. We’re very careful not to make never-ending lists of everything that can go wrong, and to pick out the really important things that need to be addressed. And your action items need to be balanced. Some should be comprehensive, some should be structural, some should be short-term fixes, and they can’t all be red hot in priority. Let’s say you have a lower priority action item that would need to be done by another team, for example. You might legitimately want to defer on that, instead of wasting political capital trying to drop work on other teams outside your direct control.

It’s vitally important to keep postmortems on the radar of your senior leadership. We have a daily standup meeting in one area here at google, where management looks over anything that’s happened in the past few days. We go through each postmortem, people present on the issue and the followup items they’ve identified, and management weighs in and provides feedback. This makes sure that the really important fixes are tracked through to completion.

SRE Antipatterns: The magical title change is something that crops up quite often. For example, you know how sometimes developers are given a book or sent to a training class, and then a month later they’re labeled as “Agile”? That same thing happens in the SRE world. Sometimes we see companies taking sysadmins, changing one or two superficial things, and labeling them “DevOps Engineers” or some other shiny new title. But nothing around them has really changed – incentives haven’t changed, and executives have not bought in to making changes that are truly going to be lasting and effective.

Another antipattern is charging ahead without getting that signoff from management. Executive level engagement on the SRE model, especially the part that has teeth – SLOs and Error Budgets – is a critical success/failure indicator. This is how we gauge whether we’re in the first working model – we’re reliable enough, let’s go faster – or in our second working model , customers are suffering, give us the resources we need. A numerical error budget, a target that is agreed upon and very specific consequences that happen when that budget gets violated – that needs to be consistently enforced, from the top.

A lot of times we find that it doesn’t take a lot of convincing to get executives onboard, but you do have to have that conversation. We talk to the leadership, who have an emotional need to see their company have a reliable product, and we help them understand it with numbers and measurements, instead of gut feel. The fact that once a system becomes unreliable, it can be months or even years until we can bring it back to a reliable state – that these are complex systems and it will take constant attention to keep it running smoothly.

Another antipattern that thankfully we don’t see too often is where SRE becomes yet another silo, a gatekeeper. It’s really important to crosspollinate knowledge, so production knowledge is shared. If it’s just one group that controls any and all production or release ownership and jealously guards that privilege, we’ve failed. So at Google, we do something called “Mission Control”, where developers can join an SRE team for 1-2 quarters. It’s a great way of disseminating knowledge and getting coders to see what it’s like on the other side of the fence.

DIRT and GameDays: We find that it’s absolutely vital to practice for failure. Netflix and others obviously have had a lot of success with Simian Army and Chaos Monkey, where SREs are whacking production systems at random to test availability. We use this approach somewhat at Google, our annual DIRT exercises– disaster recovery testing, which are company-wide. But locally, we use something less intimidating for newbies and entirely theoretical and very low-key – something we call a Wheel of Misfortune exercise.

It works almost like a D&D board game. It’s held once a week, and lots of SRE’s show up at the arena as onlookers, because it’s fun. There’s a gamemaster present, who throws the scenario – something that actually happened, not too long ago – on a whiteboard. A SRE takes on the role of a “player”, someone who’s handling incident response. As they walk through how they’d handle troubleshooting the incident and debugging, the gamemaster responds with what happens. A lot of times the gaps come down to documentation – where’s the playbook for this system? What information would have helped that support team get to a root cause faster? It’s great for the audience, because it’s very engaging and collaborative – a great group socialization exercise. We always end up with lots of creative action items that help us down the road.

Livesite Support: We do feel that it’s vital that development teams do some kind of production support. Now we throw around that “at least 5%” number a lot, but that’s really just a generic goal. The real aim here is to break down that palace wall, that silo between developers and operations. Many people assume that at Google every team has SRE’s, but that’s not the case. In fact our default model is 100% developer supported services end to end – SRE’s are really used more for high profile public facing or mission critical systems. The more dev ownership of production you can get, the more you’ll be able to sustainably support production.

Reducing toil is always top of mind for us. Any team tasked with operational work will have some degree of toil – manual, repetitive work. While toil can never be completely eliminated, it can and should be watched carefully. Left unchecked, toil naturally grows over time to consume 100% of a team’s resources. So it’s vital to the health of the team to focus relentlessly on tracking and reducing toil – that’s not a luxury, it’s actually necessary for us to survive.

At Google, we keep a constant eye on that toil/project work dichotomy, and we make sure that it’s capped. We’ve found that a well run SRE team should be spending no more than 50% of its time on toil. A poorly run SRE team might end up with 95% toil. That leaves just 5% of time for project work. At that point you’re bankrupt – you don’t have enough time to drive down your toil or eliminate the things that are causing reliability issues, you’re just trying to survive, completely overwhelmed. So part of that policy you agree upon with your stakeholders must be enforcing that cap on toil, leaving at least half of your capacity for improving the quality of your services. Failing to do that is definitely an antipattern, because it leads to becoming overwhelmed by technical debt.

References:

[sre] – Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy, O’Reilly Media, April 2016, ISBN-10: 9781491929124, ISBN-13: 978-1491929124

[ghbsre] – The Site Reliability Workbook: Practical Ways to Implement SRE, By Niall Murphy, David Rensin, Betsy Beyer, Kent Kawahara, Stephen Thorne, O’Reilly Media, August 2018, ISBN-10: 1492029505, ISBN-13: 978-1492029502

[gsre] – https://landing.google.com/sre/book.html – free PDF versions of the revised [sre] text and the followup handbook.

[kieran] – “Managing Misfortune for Best Results”, 8/30/2018, Kieran Barry, SREcon EMEA, https://www.usenix.org/node/218852. This is a great overview of the Wheel of Misfortune exercises in simulating outages for training, and some antipatterns to avoid.

DevOps Stories – an Interview with Ryan Comingdeer of Five Talent Software

Ryan Comingdeer is the CTO of Five Talent Software, a thriving software consultancy firm based in Oregon with a strong focus on cloud development and architecture. Ryan has 20 years of experience in cloud solutions, enterprise applications, IoT development, website development and mobile apps.

Obstacles in implementing Agile: Last week I was talking to a developer at a large enterprise who was boasting about their adoption of Agile. I asked him – OK, that’s terrific – but how often do these get out the door to production? It turns out that these little micro changes get dropped off at the QA department, and then is pushed out to staging once a month or so… where it sits, until it’s deemed ready to release and the IT department is ready – once a quarter. So that little corner was Agile – but the entire process was stuck in the mud.

The first struggle we often face when we engage with companies is just getting these two very different communities to talk to one another. Often its been years and years of the operations department hating on the development team, and the devs not valuing or even knowing about business value and efficiency. This is hard work, but understanding that philosophy and seeing the other side of things is that vital first step.

I know I’ve won in these discussions – and this may be 12 meetings in – when I can hear the development team agreeing to the operations teams goals, or an Operations guy speaking about development requirements. You have to respect each other and view work as a collaborative effort.

For the development teams, often they’re onboard with change because they recognize the old way isn’t working. Often times the business throws out a deadline – ‘get this done by April 1^st‘ – and when they try to drill into requirements, they get an empty chair. So they do the best they can – but there’s no measurable goals, no iterative way of proving success over an 18 month project. So they love the idea of producing work often in sprints – but then we have to get them to understand the value of prototyping, setting interim deliverables and work sizing.

Then we get to the business stakeholders, and have to explain – this is no longer a case where we can hand off a 300-page binder of requirements and ask a team to ‘get it done’. The team is going to want us involved, see if we’re on the right track, get some specific feedback. Inevitable we get static over this – because this seems like so much more work. I mean, we had it easy in the old days – we could hand off work, and wait 12 months for the final result. Sure the end result was a catastrophic failure and everybody got fired, but at least I wasn’t hassled with all these demos and retrospectives every two weeks! That instant feedback is really uncomfortable for many leaders – there’s no insulation, no avoidance of failure. It does require a commitment to show up and invest in the work being done as it’s being done.

Retrospectives for me are one of the best things about Agile. I wish they were done more often. We do two, one internally, then a separate one with the customer so we’re prepared – and we’re upfront, here’s where we failed, here’s the nonbillable time we invested to fix it. You would think that would be really damaging, but we find it’s the opposite. The best thing a consulting company can do is show growth, reviewing successes and failures directly and honestly to show progress. Our relationships are based on trust – the best trust building exercise I’ve seen yet is when we admit our failure and what we’re going to do to fix it. I guarantee you our relationship with the customer is tighter because of how we handled a crisis – versus attempting to hide, minimize, or shift blame.

Implementing DevOps: It’s very common that the larger organizations we work with aren’t sure of where to start when it comes to continuous integration or CD. Where do I begin? How much do I automate? Often it comes down to changing something like checking in a new feature say after two weeks of work. That’s just not going to cut it – what can we deliver in 4 hours?

That being said, CI/CD is Step 1 to DevOps; it’s fundamental. Infrastructure as Code is further down the list – it takes a lot of work, and it’s sometimes hard to see the value of it. Then you start to see the impact with employee rotation and especially when you have to rollback. And think about how much easier it makes it when you have to rollback changes – you can see what was changed and when; without it, you might be stuck and have to fix a problem in place. The single biggest selling point for Infrastructure as Code is security; you can demonstrate what you’re doing to regulate environments, you can show up to an audit prepared with a list of changes, who made them and what they were, and a complete set of security controls.

A True MVP: Most of the companies we work with come to us because they’ve got a huge backlog of aging requests, these mile-long wish lists from sales and marketing teams. We explain the philosophy behind DevOps and the value of faster time to market, small iterations, and more stable environments and a reliable deployment process. Then we take those huge lists of wishes and break them down into very small pieces of work, and have the business prioritize them. There’s always one that stands out – and that’s our starting point.

The first sprint is typically just a proof of concept of the CI/CD tools and how they can work on that top #1 feature we’ve identified. The development team works on it for perhaps 2 days, then sysops takes over and uses our tooling to get this feature into the sandbox environment and then production. This isn’t even a beta product, it’s a true MVP – something for friends and family. But it’s an opportunity to show the business and get that feedback that we’re looking for – is the UI ok? How does the flow look? And once the people driving business goals sit down and start playing with the product on that first demo, two weeks later, they’re hooked. And we explain – if you give us your suggestions, we can get them to staging and then onto production with a single click. It sells itself – we don’t need long speeches.

The typical reaction we get is – “great, you’ve delivered 5% of what I really want. Come back when it’s 100% done.” And the product is a little underwhelming. But that’s because we’re not always sticking to the true definition of a minimum viable product (MVP). I always say, “If an MVP is not something you’re ashamed of, it’s not a MVP!” Companies like Google and Amazon are past masters at this – they throw something crude out there and see if it sticks. It’s like they’re not one company but 1,000 little startups. You’ve got to understand when to stop, and get that feedback.

I’ve seen customers go way down in the weeds and waste a ton of money on something that ends up just not being viable. One customer I worked with spent almost $250K and a year polishing and refactoring this mobile app endlessly, when we could have delivered something for about $80K – a year earlier! Think of how the market shifted in that time, all the insights we missed out on. Agile is all about small, iterative changes – but most companies are still failing at this. They’ll make small changes, and then gate them so they sit there for months.

When we start seeing really progress is when the product is released ahead of deadline. That really captures a lot of attention – whoa, we wanted this app written in 15 months, you delivered the first version in two weeks – nine months in we can see we’re going to be done 4 months early because of our cadence.

So here’s my advice – start small. Let me give you one example – we have one customer that’s a classic enterprise – they’ve been around for 60 years, and it’s a very political, hierarchical climate, very waterfall oriented. They have 16 different workloads. Well, we’re really starting to make progress now in their DevOps transformation – but we never would have made it if we’d tried this all-in massive crusade effort. Instead, we took half of one workload, as a collection of features and said – we’re going to take this piece and try something new. We implemented Agile sprints and planning, setup automated infrastructure, and CI/CD. Yeah, it ruffled some feathers – but no one could argue with how fast we delivered these features, and how much more stable they were, and how happy the customers were because we involved them in the process.

The biggest problem we had was – believe it or not – getting around some bad habits on having meetings for the sake of having meetings. So we had to set some standards – what makes for a successful meeting? What does a client acceptance meeting look like?

Even if you’re ‘just a developer’, or ‘just an ops guy’, you can create a lot of change by the way you engage with the customer, by documenting the pieces you fill in, by setting a high standard when it comes to quality and automation.

Documentation: I find it really key to write some things down before we even begin work. When a developer gets a two week project, we make sure expectations are set clearly in documentation. That helps us know what the standards of success are, gets QA on the same page – it guides everything that we do.

I also find it helps us corral the chaos caused by runaway libraries. We have a baseline documentation for each project that sets the expectation of the tools we will use. Here, I’ll just say – it’s harder to catch this when you’re using a microservice architecture, where you have 200 repos to monitor for the Javascript libraries they’re choosing. Last week, we found this bizarre PDF writer that popped up – why would we have two different PDF generators for the same app? So we had to refactor so we’re using a consistent PDF framework. That exposed a gap in our documentation, so we patch that and move on.

Documentation is also a lifesaver when it comes to onboarding a new engineer. We can show them the history of the project, and the frameworks we’ve chosen, and why. Here’s how to use our error logging engine, this is where to find Git repos, etc. It’s kept very up to date, and much of it is customer facing. We present the design pattern we’ll be using, here’s the test plans and how we’re going to measure critical paths and handle automated testing. That’s all set and done before Day 1 with the customer so expectations are in line with reality.

We do use a launch checklist, which might cover 80% of what comes up – but it seems like there’s always some weird gotchas that crop up. We break up our best practices by type –for our Microsoft apps, IOT, monoliths, or mobile – each one with a little different checklist.

It’s kind of an art – you want just the right amount, not too much, not too little. When we err, I think we tend to over-document. Like most engineers, I tend to overdo it as I’m detail-oriented. But for us documentation isn’t an afterthought, they’re guardrails. It sets the rules of engagement, defines how we’re measuring success. It’s saved our bacon, many times!

Microservices: You can’t just say ‘microservices are only for Netflix or the other big companies’. It’s not the size of the team, but the type of the project. You can have a tiny one-developer project and implement it very successfully with microservices. It does add quite a bit of overhead, and there’s a point of diminishing returns. We still use monolith type approaches when it comes to throwaway proofs of concept, you can just crank it out.

And it’s a struggle to keep these services discrete and finite. Let’s say you have a small application, how do you separate out the domain modules for your community area and say an event directory so they’re truly standalone? In the end you tend to create a quasi-ORM, where your objects have a high dependency on each other; the microservices look terrific at the app or the UI layer, but there’s a shared data layer. Or you end up with duplicated data, where the interpretation of ‘customer’ data varies so much from service to service.

Logging is also more of a challenge – you have to put more thought into capturing and aggregating errors with your framework.

But in general, microservices are definitely a winner and our choice of architecture. Isolation of functionality is something we really value in our designs; we need to make sure that changes to invoicing won’t have any effect on inventory management or anything else. It pays off in so many ways when it comes to scalability and reliability.

Testing: We have QA as a separate functional team; there’s a ratio of 25 devs to every QA person. We make it clear that writing automated unit tests, performance tests, security tests – that’s all in the hands of the developers. But manual smoke tests and enforcing that the test plans actually does what it’s supposed to is all done by the QA dept. We’re huge fans of behavior driven development, where we identify a test plan, lay it out, the developer writes unit tests and QA goes through and confirms that’s what the client wanted.

With our environments, we do have a testing environment set up with dummy data; then we have a sandbox environment, with a 1 week old set of actual production data where we do performance and acceptance testing. That’s the environment the customer has full access to. We don’t do performance testing against production directly. We’re big fans of using software to mimic production loads – anywhere from 10 users/sec to 10K users/sec, along with mocks and fakes with our test layer design.

Continuous Learning: To me continuous learning is really the heart of things. It goes all the way back to the honest retrospectives artifact in scrum – avoiding the blame game, documenting the things that can be improved at the project or process level. It’s never the fault of Dave, that guy who wrote the horrible code – why did we miss that as a best practice in our code review? Did we miss something in how we look at maintainability, security, performance? Are lead developers setting expectation properly, how can we improve in our training?

Blame is the enemy of learning and communication. The challenge for us is setting the expectation that failure is an expected outcome, a good thing that we can learn from. Let’s count the number of failures we’re going to have, and see how good our retrospectives can get. We’re going to fail, that’s OK – how we learn from these failures?

Usually our chances of winning come down to one thing – a humble leader. If the person at the top can swallow their pride, knows how to delegate, and recognize that it will take the entire team to be engaged and solve the problem – then a DevOps culture change is possible. But if the leader has a lot of pride, usually there’s not much progress that can be made.

Monitoring: Monitoring is too important to leave to end of project, that’s our finish line. So we identify what the KPI’s are to begin with. Right now it revolves around three areas – performance (latency of requests), security (breach attempts), and application logs (errors returned, availability and uptime). For us, we ended up using New Relic for performance indicators, DataDog for their app layer KPI’s, and Amazon’s Inspector. The OWASP has a set of tools they recommend for scanning; we use these quite often for our static scans.

Sometimes of course we have customers that want to go cheap on monitoring. So, quite often, we’ll just go to app level errors; but that’s our bare minimum. We always log, sometimes we don’t monitor. We had this crop up this morning with a customer – after a year or more, we went live, but all we had was that minimal logging. Guess what, that didn’t help us much when the server went down! Going bare-bones on monitoring is something customers typically regret, because of surprises like that. Real user monitoring, like you can get with any cloud provider, is another thing that’s incredibly valuable checking for things like latency across every region.

Production Support by Developers: Initial on-calls support is handled in-house by a separate Sysops team; we actually have it in our agreement with the customer that application developers aren’t a part of that on-call rotation. If something has made it through our testing and staging environments, that knocks out a lot of potential errors. So 90% of the time a bug in production is not caused by a code change, it’s something environmental – a server reboot, a firewall config change, a SSL cert expires. We don’t want to hassle our developers with this. But, we do have them handle some bug triage – always during business hours though.

Let’s just be honest here – these are two entirely separate disciplines, specialties. Sysops teams love ops as code and wading through server error logs – developers hate doing that work! So we separate out these duties. Yes, we sometimes get problems when we move code from a dev environment to QA – if so, there’s usually some information missing that the dev needs to add to his documentation in the handoff to sysops.

And we love feature flags and canary releases. Just last week we rolled out an IOT project to 2000 residential homes. One feature we rolled out to only the Las Vegas homes to see how it worked. It works great – the biggest difficulty we find is documenting and managing who’s getting new features and when, so you know if a bug is coming from a customer in Group A or B.

Automation: For us, automating everything is the #1 principle. It reduces security concerns, drops the human error factor; increases our ability to experiment faster with infrastructure our codebase. Being able to spin up environments and roll out POC’s is so much easier with automation. It all comes down to speed. The more automation you have in place, the faster you can get things done. It does take effort to set up initially; payoff is more than worth it. Getting your stuff out the door as fast as possible with small, iterative changes is the only really safe way; that’s only possible with automation.

You would think everyone would be onboard with the idea of automation over manually logging on and poking around on VM’s when there’s trouble, but – believe it or not – that’s not always the case. And sometimes our strongest resistance to this comes from the director/CTO level!

Security: First, we review compliance with customer – half the game is education. We ask them if they’re aware of what GDPR is – for 90% of our customers, that’s just not on their radar and it’s not really clear at this point what compliance means specifically in how we store user information. So we give them papers to review, and drop tasks into our sprints to support compliance for the developers and the sysops team with the CI/CD pipeline.

Gamedays: Most of my clients aren’t brave enough to run something like Simian Army or Chaos Monkey on live production systems! But we do gamedays, and we love them. Here’s how that works:

I don’t let the team know what the problem is going to be, but one week before launch – on our sandbox environments, we do something truly evil to test our readiness. And we check how things went – did alerts get fired correctly by our monitoring tools? Was the event logged properly? How did the escalation process work, and did the right people get the information they needed fast enough to respond? Did they have the access they needed to make the changes? Were we able to use our standard release process to get a fix out to production? Did we have the right amount of redundancy on the team? Was the runbook comprehensive enough, and were the responders able to use our knowledgebase to track down similar problems in the past to come up with a remedy?

The whole team loves this, believe it or not. We learn so much when things go bump in the night. Maybe we find a problem with auto healing, or there’s an opportunity to change the design so the environments are more loosely coupled. Maybe we need to clear up our logging, or tune our escalation process, or spread some more knowledge about our release pipeline. There’s always something, and it usually takes us at least a week to fold in these lessons learned into the product before we do a hard launch. Gamedays are huge for us – so much so, we make sure it’s a part of our statement of work with the customer.

For one recent product, we did three Gamedays on sandbox and we felt pretty dialed in. So, one week before go-live, we injected a regional issue on production – which forced the team to duplicate the entire environment into a completely separate region using cold backups. Our SLA was 2 hours; the whole team was able to duplicate the entire production set from Oregon to Virginia datacenters in less than 45 minutes! It was such a great team win, you should have seen the celebration.

Do you think you have a book in you? I do too.

You can’t wait for inspiration. You have to go after it with a club. – Jack London

Almost done with my novel, I can sense it. I think there’s about a month left until the first draft is done, and – if all goes well – all the revisions will be done and it’ll be published, perhaps December or January. I’m very proud of it so far, and I think – I hope – it’ll leave a mark. I wanted to share with you what I learned, because I almost waited too long.

I used to think you needed to be brilliant, or wait for inspiration in some café. Turns out, that’s really not the case. What I found is, it’s just a grind. You show up at the café, and you start writing, 8 in the morning – and you don’t get up until at least 2 pm. If you do that, you’ll have at least 1,000 words down – and maybe more like 3,000.

They may not be good. Some days you’ll struggle cranking out 1,000 wimpy little words, and it’ll be hot garbage. Other days, you’ll fly through, and it’ll sing off the page. Regardless – you do it every day. Five days a week, as best you can.

Guess what? After 6-9 months, you’ve got yourself a first draft of a nice little book there.

Of course, we’re not DONE yet. Now you’ve got to rewrite, where you take that pile of bricks and try to make it into a house. But, my friend, you are ALMOST there. And all it took was sitting down and writing that first page.

Do everything they tell you to do. Tell your friends that you’re writing a book, so you’re committed. (That was huge for me. Telling people I was a writer gave me a little ego boost and I found, over time, it actually became true.) Make it a topic you really like – you don’t want to spend a year or more of your life with something you aren’t truly interested in. And find a publisher before you invest too much time in your book, and (hopefully) get a contract. A good editor will help with guiding you so what you write will be worth reading. Even if you end up self-publishing, going through the work of putting together a proposal and an outline is so worth it.

For me, I really enjoyed the research phase. But if you’re not careful, you’ll spend all your time studying and looking through other people’s work – and not doing any of your own. So, on the bad days, sometimes I’d do very little writing, just research. But usually I’d force myself to write those 1,000 words first – and THEN treat myself with a book or video for research.

Just don’t wait too long. One of my favorite authors is Norman Maclean, who wrote “A River Runs Through It” and – posthumously – “Young Men and Fire”, both incredible classics. The tragic thing was, he started so late – when he was 71 years old! It’s such a terrible waste.

Don’t wait for inspiration. What you’ve got to say is something that needs to be shared, that will add value to the world. Set a goal, tell your friends about it, and start plugging away.

So, do you think you have a book in you? Something amazing and creative, something you’ve never seen anywhere before? I do too. And I can’t wait to read your first book!

DORA 2018 State of DevOps report is out!

Hey guys the 2018 State of DevOps report from Puppet/DORA is out! As always, those guys have done an amazing job. You owe it to yourself to download it and check it over, and pass it along.

Here’s the points I found most powerful:

DevOps isn’t a fad; it’s proven to make companies faster and less wasteful in producing new features.
Slower is not safer. Companies releasing every 1-6 months had abysmally slow recovery times.
We can’t eliminate toil or manual work completely – but in low performing companies, it’s basically all we do. High-performers rarely have it make more than 30% of the workday.
Outsourcing an entire function – like QA, or production support – remains a terrible idea. It represents a dramatic cap on innovation and ends up costing far more in delays than you’ll ever see with saved operational costs.
“Shift Left” on security continues to grow in popularity – because it works. The best examples are where implementing it early is made as easy and effortless as possible.

More below. Check it out for yourself, it’s such great work and very easy to read!

The difference between the greats and the not-so-great continues to widen: We’ve heard executives describe DevOps as being a “buzz word” or a “fad”. Ten years into this movement, this seems more and more out of touch with reality. Companies that take DevOps seriously as part of their DNA perform better. They deploy code 46x more frequently; they’re faster to innovate (2,555 times faster lead time). And they do it more safely. Elite performers have 7x lower change failure rate, and can recover 2,604x faster.

DevOps has been proven to lead to faster innovation and change AND produce higher quality work. Honestly, does that sound like a fad to you? (I wonder sometimes if the GM and Chrysler execs in the 1970’s were saying the same thing about Toyota…)

Releasing infrequently for “safety” is anything but. Many organizations gate releases so they’re spread out over weeks or months, in an attempt to prevent bugs or defects. This backfires terribly; while bug rates may drop, it means their time to recover is disastrously slow. For example, companies that release every 1-6 months have the exact same MTTR – 1-6 months. (!!!!)

“When failures occur, it can be difficult to understand what caused the problem and then restore service. Worse, deployments can cause cascading failures throughout the system. Those failures take a remarkably long time to fully recover from. While many organizations insist this common failure scenario won’t happen to them, when we look at the data, we see five percent of teams doing exactly this—and suffering the consequences. At first glance, taking one to six months to recover from a system failure seems preposterous. But consider scenarios where a system outage causes cascading failures and data corruption, or when multiple unknown systems are compromised by intruders. Several months suddenly seems like a plausible timeline for full recovery.”

Toil and manual work: Elite and high performing orgs do far less manual work. Just look at the percent of people’s time wasted in low performing orgs doing things like hacking out manual configs on a VM, or smoketesting, or trying to push a deployment out the door using Xcopy. Someone on an elite, high performing company might spend 20-30% of their time doing this type of shovel work; at lower performing companies, it’s basically 100% plus of their time.

Think Twice Before You Outsource: The powerful example of Maersk shows the cost of outsourcing entire functions (like testing, or Operations) to external groups. The 2018 study proves that outsourcing an entire function leads to delays as work is batched and high-priority items wait on lower-priority work in queue. This is the famous handoff waste and directly against key DevOps principles of cross functional teams:

“Analysis shows that low-performing teams are 3.9 times more likely to use functional outsourcing (overall) than elite performance teams, and 3.2 times more likely to use outsourcing of any of the following functions: application development, IT operations work, or testing and QA. This suggests that outsourcing by function is rarely adopted by elite performers. …Misguided performers also report the highest use of outsourcing, which likely contributes to their slower performance and significantly slower recovery from downtime. When working in an outsourcing context, it can take months to implement, test, and deploy fixes for incidents caused by code problems.”

In Maersk’s case, just the top three features represented a delay cost of $7 million per week. So while outsourcing may seem to represent a chance to cut costs, data shows that the delay costs and drag on your deployment rate may far outweigh any supposed savings.

Lean product management: the survey went into some detail about the qualities of Lean Product Management that they found favorable. Here’s a snapshot:

Security by audit versus part of the lifecycle: Great thoughts on how shifting left on security is a key piece of delivery. They recommend making security easy, with frameworks of preapproved libraries, packages and toolchains, and reference examples of implementation, versus late-breaking audits and the disruption and delays that causes:

“Low performers take weeks to conduct security reviews and complete the changes identified. In contrast, elite performers build security in and can conduct security reviews and complete changes in just days. …Our research shows that infosec personnel should have input into the design of applications and work with teams (including performing security reviews for all major features) throughout the development process. In other words, we should adopt a continuous approach to delivering secure systems. In teams that do well, security reviews do not slow down the development process.”

So, that’s my book report. Loved it, as always, though I’m not onboard with everything there. For example, they’ve coined a new phrase – SDO, “Software Delivery and Operational Performance.” Sorry, but to me that’s reliability – the “R” in SRE, which has been around since 2003 in the software world. I don’t see the need for another acronym around that. And they’re splitting hairs a little when separating out automated testing from continuous testing, but I might be wrong on that.

As usual, it’s brilliant, data-driven, and really sets the pace for the entire growing movement of DevOps. LOVE, love the work that Puppet and DORA are producing – keep it up guys!

DevOps Stories – Interview with Anne Steiner

Anne Steiner is the Vice President of Product Agility for cPrime. In her role, Anne sets up cross-team discovery cadences, scales product thinking in large organizations, and teaches and mentors stakeholders in leadership and product roles. Anne and her team have helped companies of all shapes and sizes to transform from traditional, project-thinking to become product-driven organizations that emphasize continuous learning. She also actively promotes building communities of practitioners in the Minneapolis/St. Paul area and frequently speaks at national and regional events. She served in the United States Marine Corps as a logistics/embarkation non-commissioned officer in the early 2000’s.

You know, people think of the military as hierarchical, rigid – but in my experience the military is incredibly flexible and dynamic. It has to be to survive in war, and war is becoming more dynamic. Decision making keeps getting pushed down to lower and lower levels.

Just for example, look how we start with boot camp. It starts with dehumanization – with the goal of teaching people that we are all the same; nobody’s special. We take away your clothes, if you’re a guy we shave off your hair. Then we teach the lesson – you do everything as a team. The USMC sets up tasks that are impossible to complete in the time allowed alone. For example, the beds are so close together that if you’re asked to make a bed – your rack mate has to help you with one side of the lower bunk, and then you help her with your side of her bunk. The lesson is, nobody succeeds alone – in boot camp, you can be perfectly right and still get screamed at. I remember once, I made my bed perfectly; the corners were good, and I still got screamed at because I had known what needed to be done and I didn’t help my teammate. The whole process is to drill into your head – this is your family now – you must succeed as a team.

The military’s approach to requirements: Besides shared values, the concept of how orders are delivered in the military has some application to DevOps. In the military, there’s a separation of concerns between the officers who give orders and the enlisted people who carry them out – similar to the division between team members and management. These two groups have very different points of view and misunderstandings or conflicts could hamper an operation or cost lives. To address this – nothing significant happens without a written order describing the commander’s intent. It’s a standard 5 paragraph order that follows the SMEAC format – Situation, Mission, Execution, Administration, and Communication.

Now, the military doesn’t expect its people to document every possible scenario or to follow the words in the order blindly – because we need our people to make independent decisions autonomously as the situation ultimately changes mid-operation. So we don’t fill in all the details but provide the high level intent. The order describes what the commander wants to accomplish, the overall goals and the time frame – you are following orders as long as you’re following the intent and haven’t violated some other direction provided. At cPrime we do the same thing, where we teach product teams something called collaborative framing. That describes what we’re doing, why we’re doing it, and who we are doing it for. That’s pretty similar to the way orders are used in the Marines – the orders provide the high-level strategy and context, and people are allowed to fill in the implementation details later.

I wish this happened more often in the development world. We shouldn’t feel like we have to spoon feed everything to dev teams with detailed requirements – what if we just gave them the intent? We could define the operating requirements, the business goals, and allow them to figure out how to solve the problem.

You want to be told why. A lot of times we aren’t told “why”, just “what” as developers. That’s what surprised me about the military – there was never a leader that I worked with that I couldn’t ask why, in a respectful manner, and be given context. That helps you understand the mission. It always surprised me how open leadership was to questions about orders.

Now I should say – orders aren’t open to question or debate all the time. Sometimes in a crunch we need orders to be followed without question; but that’s actually not the norm, contrary to what most people think.

Keys to Success: What separates out the successful orgs? I find three traits winning organizations have in common:

Bold leadership that’s willing to take risks
A culture of agility and learning
Starting with a small success story

DevOps culture changes obviously come easier with smaller companies; in larger orgs you have to find a pocket where it’s okay to experiment or where a bold leader can nurture and shelter an effort. Once you get to that point where you can start telling stories – we hit this obstacle, and here we hit some snags, but look at these results – that’s where you start to see culture change. You can’t just come in the door and say “We’re going to take risks and become a learning org!”, because you haven’t proven yourself yet. I’m always looking for that right kind of leadership protection, a willingness to experiment, and a group that wants to learn and try something different. That’s your beachhead!

A Single Mission: One of the key factors I see in many successful organizations with their DevOps transformations is to have a legitimate set of shared measures, a shared mission. In the USMC, we have a standard mission – to make Marines, and to win battles. That’s the single mission, and if something in the orders doesn’t relate to that directly – we throw it out. In the software world, it’s not that simple. Every product has a vision, every company has a mission statement. But how many can articulate that simply? Netflix does a great job with a shared mission for example – their shared goals are to retain subscriptions and to increase subscriptions. Whatever you do needs to be aligned against one of these. Can you prove that your project aligns against that? Otherwise you’ll see antipatterns like IT teams saying we have 100% uptime – yeah, that’s great, but you’ve got a crappy product and your customers are unhappy. That’s not product thinking, a clear common goal that everyone can rally around.

Flexibility and Innovation: There’s a lot of people out there writing books on Agile, and quite a few are well written. But if you slam it on the table, it’s not going to work like it says in the book. Then what are you going to do? The teams that are successful are the ones that can implement this or better yet the parts of it that they think will add value, fail, modify it to their situation, and win anyway. That’s one of the things I love about the way the Agile Manifesto was written, because it is principle based. We see a lot of organizations struggle because they bring in some “expert” who comes in with a checklist and says, no, you’re not doing scrum unless you’re doing these things. Well, who cares, as long as you’re delivering awesome products?

As a culture, the USMC takes as a point of pride that it is always asked to do more with less – to us, that “Adapt, Improvise and Overcome” mantra is a real point of pride. I think it comes in part from how we were founded. The Marine Corps has the smallest budget of the branches. There’s not a lot of money flowing through the organization. So that helps us – we realize, no matter what happens, it’s probably not going to work the first time – we’ll adapt and change. Traditionally, we think in the software world that change is bad, we have to limit it, a risk. Well change is inevitable, we should expect it – and win even if we have to come up with a new solution on the fly.

	sdorsett on GitHub Copilot and App Mo…
	Roy K on DevOps Stories – an Interview…
	jweers on Thriving in a time of cha…
	jweers on Thriving in a time of cha…
	Jess’s Unfiltered on Thriving in a time of cha…

/* driftboatdave */

adventures in cloud architecture, DevOps, and configuration management