Uncategorized

Defining DevOps Is Impossible…

Defining DevOps – what it should be, and if it should even be done – has become a surprising controversy over the past five years. The “godfather” of DevOps himself, Patrick Debois, famously resists any kind of formal definition. He thinks it comes down to culture:

The Devops movement is built around a group of people who believe that the application of a combination of appropriate technology and attitude can revolutionize the world of software development and delivery. The demographic seems to be experienced, talented 30-something sysadmin coders with a clear understanding that writing software is about making money and shipping product. More importantly, these people understand the key point – we’re all on the same side! All of us – developers, testers, managers, DBAs, network technicians, and sysadmins – are all trying to achieve the same thing: the delivery of great quality, reliable software that delivers business benefit to those who commissioned it. [debois]

This is great, but it’s hardly definitive. Just look at the Agile Manifesto for example. This gave us a definition of what Agile is (or more correctly, how it behaves) and the guiding principles behind it. Most have stood the test of time; more importantly, it’s a firm stake in the ground. We learn as much from the holes and the understressed points as we do from the things that have stuck over time. The Agile Manifesto and its underlying principles have been one of the most impactful and successful set of concepts in the 21st century in most organizations. DevOps is very much just an extension of Agile; it’s incomprehensible to us that we should deviate from this successful model and pretend that a DevOps Manifesto should be considered impossible or too formulaic.

For us personally, we find an exact definition of DevOps in terms of what it is to be elusive. Not for the lack of trying – many very experienced and brilliant people have taken stabs at it over the years. One of our most prominent thought leaders, Gene Kim, has defined it in the past as:

The emerging professional movement that advocates a collaborative working relationship between Development and IT Operations, resulting in the fast flow of planned work (i.e., high deploy rates), while simultaneously increasing the reliability, stability, resilience and security of the production environment. [kim3]

… a very good definition; it captures the elements of partnership between dev and Ops, starts with people, and ends with the results – a fast flow of work and increased quality and stability.

 
 

Some Other DevOps Definitions

Wikipedia offers this definition:

DevOps (a clipped compound of “development” and “operations”) is a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops). The main characteristic of the DevOps movement is to strongly advocate automation and monitoring at all steps of software construction, from integration, testing, releasing to deployment and infrastructure management. DevOps aims at shorter development cycles, increased deployment frequency, more dependable releases, in close alignment with business objectives.[i]

In “DevOps: A Software Architect’s Perspective” the authors define DevOps as:

DevOps is a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality.

Gartner has offered this definition:

DevOps represents a change in IT culture, focusing on rapid IT service delivery through the adoption of agile, lean practices in the context of a system-oriented approach. DevOps emphasizes people (and culture), and seeks to improve collaboration between operations and development teams. DevOps implementations utilize technology — especially automation tools that can leverage an increasingly programmable and dynamic infrastructure from a life cycle perspective. [gartner]

From Damon Edwards;

DevOps is… an umbrella concept that refers to anything that smooths out the interaction between development and operations. [damon]

The Agile Admin:

DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support. DevOps is also characterized by operations staff making use many of the same techniques as developers for their systems work. [agileadmin]

And from Rob England, aka the IT Skeptic – who we interviewed earlier in this book:

DevOps is agile IT delivery, a holistic system across all the value flow from business need to live software. DevOps is a philosophy, not a method, or framework, or body of knowledge, or *shudder* vendor’s tool. DevOps is the philosophy of unifying Development and Operations at the culture, system, practice, and tool levels, to achieve accelerated and more frequent delivery of value to the customer, by improving quality in order to increase velocity. [itskeptic]

 
 

Or Ken Mugrage of ThoughtWorks:

“DevOps: A culture where people, regardless of title or background, work together to imagine, develop, deploy and operate a system.” [mug2]

 
 

Putting all these definitions together, we’re starting to see a common thread around how important it is to prepare the ground and consider the role of culture. Perhaps Adam Jacobs takes a position similar to ours, saying that the exact definition may be best described by behavior:

“DevOps is a cultural and professional movement. The best way to describe devops is in terms of patterns and anti-patterns.”

… which is exactly what we’ve tried to do in our upcoming book.

 
 

 
 

Other definitions can be found in the sidebar “Some Other DevOps Definitions”. Suffice to say, there’s lots to choose from, and we won’t tell you what’s best. We can tell you our favorite though – from Donovan Brown:

“DevOps is the union of people, process, and products to enable continuous delivery of value to our end users.” [donovan]

This is exactly the right order of things and there’s not a wasted word; we can’t improve on it. This is what’s used in Microsoft as a single written definition, as it reflects what we want and value out of DevOps. Having that single definition of truth published and visible helps keep everyone on the same page and thinking more holistically.

It seems unlikely that community consensus on a single unified definition of DevOps will ever happen. The purist, engineer part of us hates this; but as time went on we realized from our research and interviews that this apparent gap was ultimately not important, and in fact was beneficial. At one conference, we remember the speakers asking the crowd how many of them were ‘doing Agile’ – about 300 hands went up, the entire audience. Then the speakers asked, a little condescendingly, “OK, now which of you are doing it right?” – and three people kept their hands up, who were then ridiculed for being bold-faced liars!

At the time we remember feeling a little shocked that so few were adhering to the stone tablets brought down from the mountain by Ken Schwaber and company. Now, we realize how shortsighted and rigid that point of view was. Agile should never have been thought of as a list of checkmarks or a single foolproof recipe. It’s likely that most of those 300 people in the audience were better off because of adopting some parts of scrum or Agile in the form of transparency and smaller amounts of work, and were building on that success. That’s far more important than ‘doing it right’.

The same holds true with DevOps and the principles behind continuous delivery. No single definition of ‘doing DevOps right’ exists, and it likely never will. What we realized in gathering information from this book was that this gap is fundamentally not important. A global definition of DevOps isn’t possible or helpful; your definition of DevOps however is VERY important. Put some thought into what DevOps means in specific terms for your specific situation, defining it, and making it explicit. Having that discussion as a group and coming up with your own definition – or piggybacking on one of the above thoughts – is time well worth spending. Over time you’ll find that the exact “what” shrinks as you focus more on the “why” and the “how” of continually improving your development processes to drive more business value and feedback quality.

 
 

…But We Do Know How It Behaves

A Manifesto is a public declaration of policy, a declaration or announcement; here we’re standing on the shoulder of the giants that came up with the brilliant Agile Manifesto – easily the most groundbreaking and impactful set of principles in the software development field in the past thirty years.

Since the Agile Manifesto was written in 2001, we’ve learned some fault points and pitfalls with Agile implementations and its guiding principles[1]:

Strengths

Weaknesses

Processes and tools placed in second position to the makeup of the team, including direct communications and self-organization

Continuous delivery is (rightly) stressed; most development shops would ignore this with complex branching structures and rigid gates, causing long integration periods and infrequent releases.

Priorities are set by the business with regular checkpoints

“Deliver working software frequently” sabotaged by gap in covering QA/test and lack of automation/maturity in Operations, and siloed traditional org structures

Excessive documentation and lengthy requirements gathering shelved in favor of responding to change

Exploding technical debt caused by ignoring the principle “continuous attention to technical excellence and good design” in favor of velocity

Sprint retrospectives, daily scrums, and other artifacts showing accountability and transparency

Agile practices work best with small units and don’t address either epic/strategic level planning (besides “responding to change”) and how to scale effectively in large organizations. (SAFe is making great headway in addressing this)

Time boxed development periods followed by releases – the shorter the better (1-4 weeks)

  

High trust teams (“give motivated individuals the environment and support they need, and trust them to get the job done.”)

  

Simplicity – the art of maximizing the amount of work not done – addresses the recurring shame of the software industry: most features as delivered are not used or do not meet business requirements

  

Reflection (tuning and adjusting) key to building a learning / iteratively improving culture

  

 
 

For being (as of this writing) nearly 20 years old, this set of principles has weathered amazingly well. Most problems we’ve seen to date have been caused by misapplications, not with the thinking of the original architects themselves.

You’ll notice though that we didn’t stop there. However far-seeing and visionary the original signers were of the Agile Manifesto, there were some gaps exposed over the past ten years that need to be addressed. For starters, the Agile Manifesto favored individual interactions over processes and tools (ironic, since to many “Agile” has become synonymous with a tool, Version control, and a process, Daily Scrums and Retrospectives!) Agile was wildly successful in creating tightly focused development teams with a good level of trust-based interactions; the pendulum may have swung too far to the right on the fourth principle on “responding to change”. Companies have had varied success in scaling Agile beyond small working teams; we’ve seen heinous practices like 20-person drum circles and endless daily scrums combined with a complete lack of strategy – doing the wrong thing sprint by sprint with massive amounts of thrash. It is completely possible to execute Agile with a strategic vision and with a good level of planning; this is covered in more depth in a previous chapter.

This first point was more of a flaw in application by companies that misunderstood (or took too far) Agile principles. The second most dangerous flaw in the manifesto however was in what was only mentioned once and is most often overlooked – quality. Consistently, across almost all Agile implementations, we see teams struggling with the outcome of a tight focus on velocity in the form of managing technical debt. Some even have proposed adding a fourth point to the Project Management Triangle of functionality, time, and resources, something we and several others disagree with. (It tends to muddy the waters and imply that quality is a negotiating point with project managers and can be adjusted; successful teams from the days of Lean Manufacturing on build quality into the process as early as possible in the pipeline as their way of doing business; its built-in as a design factor into all project plans and time estimates.) Scrapping excessive documentation and over-specced requirements was a masterstroke; as we have seen, too many orgs have misinterpreted this as meaning “no documentation” and shortchanged QA teams and testing during project crunches, and left their software in a nonworking state for much of the development process.

The third flaw, that tight myopic focus on the development of code, is what DevOps is designed to resolve – which is why DevOps has been called “the second decade of Agile”. We’ve discussed this at length earlier, but we’ll say it again – if it isn’t running, in production, with feedback to the delivery team – it doesn’t count. Agile was meant to deliver working software to production, where continual engagement with stakeholders/product owners would fine-tune features and reduce waste due to misunderstandings. Yet it addressed only software development teams, not the critical last mile of the marathon where software releases are tested, delivered to production, and monitored.

And so, in an attempt to resolve this problem of the “last mile” – along comes DevOps, sprinting to the forefront about ten years after the Agile manifesto was written. (What will we be calling this in ten years more, I wonder?) While the exact definition of DevOps remains in flux – and likely will remain so for some time – there’s a very clear vision of the evils DevOps is attempting to resolve. Stephen Nelson-Smith put it very frankly:

“Let’s face it – where we are right now sucks. In the IT industry, or perhaps to be more specific, in the software industry, particularly in the web-enabled sphere, there’s a tacit assumption that projects will run late, and when they’re delivered (if they’re ever delivered), they will underperform, and not deliver well against investment. It’s a wonder any of us have a job at all!”

Stephen went on to isolate four problems that DevOps is attempting to solve:

  1. Fear of change (due to a well founded belief that the platform/software is brittle and vulnerable; mitigated by bureaucratic change-management systems with the evil side effect of lengthy cycle times and fix resolution times
  2. Risky deployments (Will it work under load? So, we push it out at a quiet time and wait to see if it falls over)
  3. It works on my machine! (the standard dev retort once sysadmins report an issue, after a very cursory investigation) – this is really masking an issue with configuration management
  4. Siloization – the project team is split into devs, testers, release managers and sysadmins. This is tremendously wasteful as a process as work flows in bits and dribbles (with wait times) between each silo; it leads to a “lob it over the wall” philosophy as problems/blame are passed around between “team” members that aren’t even working in the same location. This “us versus them” mentality that often results leads to groups of people who are simultaneously suspicious of and afraid of each other.

 
 

These four problems seem to be consistent and seems to hit the mark of what the DevOps movement – however we define it – is trying to solve. DevOps is all about punching through barriers – large sized, manual deployments that break, firefighting bugs that appear in production due to a messy or inefficient testing suite or mismanaged configurations and ad hoc patches, and long wait times between different siloes in a shared services org.

So, the problem set is defined. Are there common binding principles we can point to that could be as useful as the Agile Manifesto was back in the 2000’s?

 
 

The Tolstoy Principle

We keep circling back to the famous opening lines Tolstoy wrote for his masterpiece “Anna Karenina”:

“All happy families are alike; every unhappy family is unhappy in its own way.”

Just as it’s a mistake to be overly prescriptive and recipe-driven with either Agile or DevOps – it would be even worse to repeat the “scrumterfall” antipatterns we’ve seen and throw the last ten years of hard-won lessons and principles out the window because “our company is unique and special / our business won’t allow this”. Tolstoy noted a fact that applies to organizations as well as families: Happy families tend to (even unconsciously) have certain common patterns and elements, well defined roles, and follow a structure that creates the environment for success. Unhappy families tend to have a lot of variance, little discipline (or too much), great inconsistency in how rules are followed, and no introspection or learning so that things iteratively improve.

Building on this definition, and thinking back to the Tolstoy Principle (all happy families are alike), we believe there are some common traits found in happy, successful DevOps families:

  • Fast release cycles and continuous delivery
  • Cross functional small teams responsible for product end-to-end
  • Continual learning and customer engagement
  • Discipline and a high level of automation

 
 

We could be grandiose and call this the DevOps Manifesto -but of course that’s neither possible or really necessary. Let’s just call this what it is – an observation of four key principles you’ll want to include in your vision. This attempts to define DevOps by how it behaves versus a prescriptive process, and we believe it adds on the foundation laid by the simple, neat definition of DevOps we lean towards: “the union of people, process, and products to enable continuous delivery of value to our end users”.

Tolstoy here put his finger on something that applies to organizations as well as families: Happy families tend to have certain common patterns and elements, well defined roles, and follow a structure that creates the environment for success. Unhappy families tend to have a lot of variance, either too much or too little discipline, great inconsistency in how rules are followed, and no introspection or learning so that things iteratively improve.

There’s an abundance of literature and material produced on DevOps and how it has addressed the three gaps above; for us that begins with the inspirational work produced by the “Big Three” of Jez Humble, Gene Kim, and Martin Fowler. Sifting through this mountain of research, we’re humbled by the quality of thought and vast amount of heroic effort that went into completing our Agile journey and eliminating waste in delivering business value faster across our industry. We also believe each presents different facets (or from different points of view) of the four core qualities or principles covered with our DevOps Manifesto above. All happy DevOps families truly are alike.

Let’s break each principle down in more detail: 

Fast release cycles and continuous delivery

This is the one KPI that we feel is consistent and tells the most about the true health of a software delivery lifecycle: how long does it take for you to release a minor change to production? This tells you your cycle time; it’s not uncommon for customer requests to be tabled for months or years as development and IT teams are buried in firefighting or a lengthy list of aging stories.

A second question is, How many releases do you deliver, on average, per month? Increasing the frequency of your production releases is the best indicator we know of that a DevOps effort is actually gaining traction. In fact, if this is the only outcome of your DevOps adoption program – that release times are reduced by 50% or more – it’s likely that you can consider your effort an unqualified success.

If you have a fast cycle time, you are living the spirit of DevOps – your teams are delivering software at a fast clip and your customers do not have to wait for unacceptable lengths of time for new value to appear, be tested, and iteratively improved (or discarded if the new feature is not successful).

If you have frequent releases to production – your releases are small, incremental. This means you can quickly isolate and resolve problems without wading through tens of thousands of bundled changes; as the team has practiced releases thousands of times including rollbacks, everyone is comfortable with the release cycle and problems both with code, integration, and threats to your jugular – the release pipeline itself – will be fixed quickly. The old antipattern of the “war room” release with late hours frantically fixing bugs and issuing emergency hotfixes will become a thing of the past.

This goes without saying but just to be clear; by “fast release cycle” we mean fast, all the way, to production. That’s the finish line. A fast release cycle to QA – where it will sit aging on a shelf for weeks or months – gains us nothing. And by “continuous delivery” we mean “no long-running branches outside of mainline”. In the age of Git and distributed development there’s room for flexibility here, but one fact has remained constant since Jez Humble’s definitive book on the subject: lengthy integration periods are both evil and avoidable. It’s perfectly acceptable and perhaps necessary to have a release branch so issues can be reproduced; long running feature branches, almost inevitably, cause much more pain than they are trying to solve. This we will also explore later; suffice to say, we haven’t yet encountered an application that couldn’t be delivered continuously, with a significant amount of automation, direct from mainline. Your developers should be checking into mainline, frequently – multiple times a day – and your testing ecosystem should be robust enough to support that.

Teams that ignore this and build their release pipeline with complex branching strategies end up incurring the wrath of the Integration Gods, whose revenge is terrible; they are afflicted with lengthy and disruptive stabilization periods where the software is left in a nonworking or unreleasable state for long periods of time as long-lived branches are painfully merged with main. The Agile Manifesto focused on delivering working software quickly in contrast to lengthy documentation; the DevOps Manifesto extends this by calling on software delivery teams to deliver that working software, to production, continuously – from mainline.

We did mention multi-month milestones as an antipattern; this ties in with our Agile DNA of favoring responding to change and ongoing customer collaboration over following lengthy waterfall-type delivery plans and hundred-page requirements documentation that ages faster than milk. Still, it’s foolish for us to throw planning out the window and pretend that we are only living in the moment; software is developed tactically but should always adhere to a strategic plan that is flexible but makes sure we are hitting the target versus reactively shifting priorities sprint to sprint. We’ll cover the planning aspects more in a later chapter.

By “fast release cycles” we are very careful not to define what that means for you, exactly. Does it really matter to your customers or business if you can boast about releasing 1,000 or 10,000 releases a day? Of course not; a count of release frequency is a terrible goal by itself, and has nothing to do with DevOps. But as an indicator, it’s a great litmus test – are our environments and release process stable enough to handle more frequent releases? Teams that are on the right track in improving their maturity level usually show it by a slow and steady increase in their release frequency. We’ll point you to Rob England’s story earlier in this book of his public sector customer, whose CIO made an increased rate of release – say every 6 weeks instead of 6 months – a singular goal for their delivery teams. A steadier cadence meant pain, which in turn forced improvement. This worked for them because in their case deployments were their pain point – as Donovan Brown is fond of saying, “If it hurts, do it more often!”

 
 

Cross functional small teams

As Amazon CTO Werner Vogels says; “you build it, you run it”.

We’ll get more into team dynamics later. Suffice to say that over the past twenty years the ideal team size has been remarkably well defined – anywhere from 8 to 12 people. Less, and the teams are often too small to do the end-to-end support they’re going to be asked to do; more and team efficiency and nimbleness drops dramatically. Jeff Bezos of Amazon is famous for quipping, “Communication is terrible!” – in the sense that too much time is wasted in large teams. The “two pizza” rule begun in Amazon – where if a team grows larger than can be fed with two pizzas, its broken up – has been applied in many medium and large-sized organizations with close to universal success.

The sticking point here for most organizations is the implications of “cross functional”. Software development teams are offshoots of corporations after all; corporations and large industries were born from the Industrial Era. The DNA that we have inherited from that time of mass production, experimental science and creative innovation worked very well for its time – including grouping specialists together in functional groups. In software development teams however, that classical organizational structure works against us – each handoff from group to group assigned to a particular task lengthens the release cycle and strangles feedback. Again, we’ll cover this in greater detail later in the book – suffice to say, there is no substitute or workable alternative we know of to having a team responsible for its work in production. Efforts to form “virtual teams” where DBA’s, architects, testers, IT/Ops and developers resolves some problems around communication but the fact that each member has a different direct report or manager – and often different marching orders – creates the seeds for failure from the get-go.

We’re well aware that asking companies to change their structure wholesale from functional groups to a vertical structure supporting an app or service end-to-end is a mammoth undertaking. Some companies have made the painful but necessary leap in a mass effort; Amazon, Netflix, and Microsoft included. If your organization has massive problems – we’re talking an existential threat to survival, the kind that ensures enthusiastic buy-in from the topmost levels – and a strong, capable army of resources, this wartime-type approach may deliver for you. (See the Jody Mulkey interview in this book for a discussion on how this kind of a pivot can be structured and driven.) But a word of caution – speak to the survivors of these kinds of massive, wrenching transformations, and they’ll often mention the bumps and bruises they suffered that in retrospect could have been avoided. Often in most enterprises the successful approach is the slow and gradual one. More on this in a later chapter.

We hate how prescriptive Agile has become – and creating unnecessary or silly rules is one mistake that we don’t want to repeat. Over the past twenty years however, software teams in practice have finally caught up to the way cross functional units are built in the military, SWAT teams and elsewhere. It does appear to be a consistent guideline and a necessary component of DevOps – small teams are better than large ones, and efficient, nimble teams are usually 8 to 12 people in size.

Why is it important though that a team handles support in production? At an Agile conference in Orlando in 2011, one presenter made a very impactful statement – he said, “For DevOps to work, developers must care more about the end result – how the application is functioning in production. And Operations must care more about the beginning.” With siloed teams and separate areas of responsibility, there’s too much in the way of customer feedback, usage and monitoring data making their way back to the team producing features. Having the team be responsible for handling bug triage and end user support removes that barrier; this can be uncomfortable but in terms of keeping the team on point and delivering true business value – and adjusting as those priorities shift – there again is no substitute. It solves the problem mentioned at that conference – suddenly developers care, very much, about how happy their user base is and how features are running in production; Operations people, by being folded into the beginning and sharing the same focus and values as the project team, will be in a much better position to pass along valuable feedback so the team will stay on target.

In our experience, we’ve found very few companies where Agile transformations have not worked – in fact, we’ve never seen one fail. This is because the scope of Agile is limited to just the development portion; limiting the scope to just one group of people who often think and value the same things alike is a good recipe for a coherent mission and success. In contrast, DevOps efforts are fraught with peril. In the past five years, we’ve seen nearly a 50% failure rate; there is inevitably very strong resistance pockets even with strong executive and leadership support. DevOps has become both a controversial word over the past decade and a very disruptive and chancy – risky – organizational challenge.

Why the resistance? Part of it is due to the cross-cutting nature of development. For DevOps to work – really work – it requires a sea change of difference in how organizations are structured. Most medium to large sized orgs have teams organized horizontally in groups by function – a team of DBA’s, a team of devs, QA, Operations/IT, a project management and BSA layer, etc. This structure was thought to improve efficiency by grouping together specialists, and each is jealously guarded by loyalists and executives intent on protecting and expanding their turf. Most successful DevOps efforts we’ve seen require a vertical organization, where teams are autonomous and cross discipline. There are exceptions – some are mentioned in case studies in this book. But even with those exceptions, their adoption of DevOps has been slower than it could have been; eliminating these siloes appears to be a vital ingredient that can’t be eliminated from the recipe.

Another reason is that we are trying to “smoosh” together two groups with diametrically opposite goals and backgrounds. Operations teams are paid and rewarded based on stability; they are focused on high availability and reliability. Reliability and stability are on opposite ends of the spectrum from where development teams operate – change. Good development teams tend to focus on cool, bleeding-edge new technology its application in solving problems and delivering features for their customers – in short, change. This change is disruptive and puts at risk the stability and availability goals that IT organizations fixate on.

A culture of learning and customer engagement

This is another leaf off of the Agile tree, in this case the branches having to do with customer collaboration and responding to change. The signers of the original Agile Manifesto intended those last two principles to correct some known weaknesses of the old Gantt-dominated long-running projects; inflexible requirements that were hard-set as a contract at the beginning of a long-running project, leading to software not matching what their business partners were asking for. Perhaps the customers were really not sure what they wanted; perhaps their business objectives changed over the months code was in development.

Many so-called “scrum masters” get lost in the different ceremonies and artifacts around Agile and Scrum, and forget the key component of continual engagement with a stakeholder. Any true Agile team uses a sprint development cycle of a very short period of time – 1 to 4 weeks, the shorter the better – where at the end a team has a review with the business stakeholder to check on and correct their work. We knew we were going to be wrong at the end of the delivery cycle and the features we deliver wouldn’t meet the customer’s expectations. That’s OK – at least we could be wrong faster, after two weeks instead of 6 months. Checking in with the customer regularly is a must-have for Agile team; in retrospect, that continual engagement became the most powerful and uplifting component of Agile development.

Keeping a learning attitude leads to blame free postmortems – the single big common point. Is it safe to make changes? Do we learn?

Discipline and a high level of automation

One of the biggest antipatterns seen with scrum and Agile was the lack of a moat – anyone can (and did) fork over a thousand bucks for a quick course, call themselves “Scrum Certified”, and put up their shingle as an Agile SDLC consultant. Of the dozens of certified scrum masters I’ve met – shockingly few have ever actually written a line of code, or handled support in any form in a large enterprise.

Thankfully, I don’t see that happening with DevOps. You just can’t separate out coding, tools, and some level of programming and automation experience from running large-scale enterprise applications in production. So, tools are important.

 

Wrapping It Up

The four core DevOps principles we discuss above are – we believe – fundamental to any true DevOps culture. Removing or ignoring any one of these will limit your DevOps effectiveness, perhaps crippling it from inception.

For example, having an excessively large team will make the “turn radius” of your team too wide and cut down on efficiency. If the team is not responsible for the product end to end the feedback cycles will lengthen unacceptably, and too much time will be wasted fighting turf battles, trying to shove work between entrenched siloes with separate and competing priorities, and each team member comes into the project with a different perspective and operational goals. Any DevOps effort focused on “doing DevOps” and not on reducing release cycles and continuously delivering working software is fundamentally blinded in its vision. Not having the business engaged and prioritizing work creates waste as teams waste effort guessing on correct priorities and how to implement their features. Learning type organizations are friendlier to the amount of experimentation and risk required to weather the inevitable bumps and bruises as these changes are implemented and improved on. And without automation – a high level of automation both in building and deploying releases, executing tests, supporting services and applications in production and providing feedback from telemetry and usage back to the team – the wheels begin to fall off, very quickly.

In the case of DevOps, we believe there are certain common qualities that define a successful DevOps organization, which should be the end goal of any DevOps effort. There will be no DevOps Manifesto as we have with Agile –but success does seem to look very much the same, regardless of the enterprise. All DevOps families, it turns out, are very much alike.

 
 

Other Views of DevOps

There’s been many efforts to break DevOps down into a kind of taxonomy of species, and some stand out. For example, Seth Vargo of Google broke down DevOps into five foundational pillars:

  • Reduce organizational siloes (i.e. shared ownership)
  • Accept failure as normal
  • Implement gradual change (by reducing the costs of failure)
  • Leverage tooling and automation (minimizing toil)
  • Measure everything

…. Which we find quite nifty, and covers the sweet spots.

 
 

A book we quite admire and quote from quite a bit is Accelerate, which lists some key capabilities, broken into five broad categories: Continuous Delivery, Architecture, Product and Process, Lean Management and Monitoring, and Culture. This is another very solid perspective on how DevOps looks in practice.

Continuous Delivery

Use VC for all production artifacts

Automate your deployment process

Continuous integration

Trunk-based development methods (fewer than 3 active branches, branches and forks having short lifetimes (<1 day), no “code loc” periods where no one can do pull requests/check out due to merging conflicts, code freezes, stabilization phases)

Test automation (reliable, devs primarily responsible)

Support test data management

Shift left on security

Continuous delivery (software in deployable state throughout lifecycle, team prioritizes this over any new work)

Includes visibility on deployability / quality to all members

The system can be deployed to end users at any time on demand

  

Architecture

  

Loosely coupled architecture; a team can test and deploy on demand without requiring orchestration

Empowered tools (I can choose my own tools)

  

Product and Process

  

Gather and implement customer feedback

Make the flow of work visible (i.e. value stream)

Work in small batches. MVP, rapid dev and frequent releases – enable shorter lead times and faster feedback loops

Foster and enable team experimentation

  

Lean management and monitoring

  

Lightweight change approval process

Monitor across application and infrastructure to inform business decisions

Check system health proactively

Improve processes and manage work with work-in-process (WIP) limits

Visualize work to monitor quality and communicate throughout the team. Could be dashboards or internal websites

  

Cultural

Support a generative culture (Westrum) – meaning good information flow, high cooperation and trust, bridging between teams, and conscious inquiry.

Encourage and support learning: Is learning thought of as a cost or an investment?

Support and facilitate collaboration among teams

Provide resources and tools that make work meaningful

Support or embody transformational leadership (vision, intellectual stimulation, inspirational communication, supportive leadership, personal recognition)

  

 
 

  

 
 

 
 

 
 

 
 

References

See https://theagileadmin.com/2010/10/15/a-devops-manifesto/, a very good effort but in our opinion missing a few key pieces.

See https://devops.com/no-devops-manifesto/ – Christopher Little, writing in May 2016, feels very strongly that the very idea of a DevOps Manifesto is too rule-oriented and makes the curious argument that if one existed it would prevent any kind of meaningful dialogue. It’s an interesting position by a good writer, without much if any supporting evidence.

[debois] – http://www.jedi.be/blog/2010/02/12/what-is-this-devops-thing-anyway/

[kim3] – The top 11 things you need to know about DevOps’, Gene Kim

[donovan] – http://donovanbrown.com/post/what-is-devops

[do2] – (http://dev2ops.org/2010/02/what-is-devops/)

[agileadmin] – (https://theagileadmin.com/what-is-devops/)

[garntner] – (https://www.gartner.com/it-glossary/devops)

[itskeptic] – Rob England, “Define DevOps: What is DevOps?” – 11/29/2014, http://www.itskeptic.org/content/define-devops

[mug2] – Ken Mugrage, “My Definition of DevOps”, https://kenmugrage.com/2017/05/05/my-new-definition-of-devops/#more-4 – Note, Ken seems to agree with us that a one-size-fits-all definition isn’t of value: “It’s not important that the “industry” agree on a definition. It would be awesome, but it’s not going to happen. It’s important that your organization agree (or at least accept) a shared definition.”

 [sre] Change Management SRE has found that roughly 70% of outages are due to changes in a live system. Best practices in this domain use automation to accomplish the following: Implementing progressive rollouts Quickly and accurately detecting problems Rolling back changes safely when problems arise This trio of practices effectively minimizes the aggregate number of users and operations exposed to bad changes. By removing humans from the loop, these practices avoid the normal problems of fatigue, familiarity/ contempt, and inattention to highly repetitive tasks. As a result, both release velocity and safety increase.

 
 

 

Advertisements

Interview with Betsy Beyer, Stephen Thorne of Google

Betsy is a Technical Writer for Google in NYC specializing in Site Reliability Engineering. She co-authored the books Site Reliability Engineering: How Google Runs Production Systems and The Site Reliability Workbook: Practical Ways to Implement SRE. She has previously written documentation for Google’s Data Center and Hardware Operations Teams in Mountain View and across its globally-distributed data centers.


Stephen is a Site Reliability Engineer in Google’s London office. His book The
Site Reliability Workbook: Practical Ways to Implement SRE drew from his working at introducing SRE practices to Google customers on the Customer Reliability Engineering team. He has been an SRE at Google since 2011, and has previously worked on Google Ads, and Google App Engine.


Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

————————————

Do we see Site Reliability Engineering (SRE) as the future of DevOps? Definitely not. DevOps is really a broad set of principles and practices; SRE is a very specific implementation of that. The two are not mutually exclusive. You can look at all these DevOps principles and achieve it by applying SRE.

In the book Accelerate, they separated out four key metrics that make for a successful team: lead time, MTTR, batch size, and change success rate. All of these metrics boil down to continuous delivery – how quickly can we make changes, and how fast can we recover when things go awry?

But we look too much at this desired outcome – we’re releasing very often, with a low rate of failure – and sometimes lose sight of how we get there. It’s that implementation space where SRE fills a gap. Applying SRE principles is like having a brake on the car; we can say, hey, there’s a corner coming up, slow down and let’s handle this safely. We’ll go slow and steady for a bit, then speed up once we’re on better ground.

We commonly hear people say, “Oh, SRE works for you, because you’re a Google – a massive, web-native company.” But there are SRE things you can do at any size of company, even without a dedicated SRE team. There are some patterns and principles we wanted to stress, which is why we wrote both SRE books. Particularly around how to manage and embrace risk, and how to establish a balanced SLO and an error budget policy. These things are fundamental to a well running SRE team, and it’s something your customers need.

Two Modes of Development: SRE’s don’t take direct responsibility for releases, and our job isn’t just to be a brake on development. In fact, we operate in two modes. The first one is if we’re consistently within that error SLO, and we’re not consuming enough of our error budget – that’s actually hampering our innovation. So SRE should be advocating for increasing the speed of the pipeline. Congratulations, we’re in that sweet spot that DevOps is aiming for, low friction releases – we are a well performing team.

But the second mode is often the default mode – and that’s not stepping on the gas, it’s the ability to slow down. If we’re constantly running out of the error budget, then we have to slow things down – our rate of failure is simply too high, it’s not sustainable. We have to do whatever it takes to make it more reliable, and not defer it as debt. That’s the fourth attribute we want with DevOps – a low rate of failures in our production releases.

Error Budgets: One of the most frequent questions we got after publishing our first book had to do with forming an error budget policy. It’s actually a concept that’s pretty easy to apply at other organizations.

You can’t get away from the fact that when it comes to instability, releases are one of the primary causes. If we stop or gate releases, the chance of a release causing a problem goes way down. If things are going fine, it’s the SRE’s job to call it out that we’re being TOO reliable – let’s take more risks. And then, when we’ve run out of error budget, we want to have a policy agreed upon in advance so we can slow down the train.

We’ve seen this error policy take a number of different shapes. At one company we engaged with that’s very Agile-focused, when they know a system isn’t meeting customer expectations, developers can only pull items off the backlog if they’re marked as a postmortem action item. Another company uses pair programming a lot. So during that error budget overage period – the second mode – they mandate that one pair must be devoted purely to reliability improvements.

Now that’s not how we do it at Google – but it’s totally effective. We see companies like Atlassian, IBM, and VMWare all using error budgets and talking about it in public forums. One thing is for sure though – this policy needs to be explicit, in writing, agreed upon in advance – and supported by management. It’s so, SO important to have this discussion before you have the incident.

Business stakeholders and execs sometimes fight for zero downtime, 100% availability. So let’s say you’re a mobile gaming platform. Any downtime for you means money lost and perhaps customers out the door. So, how are you going to get to 100% reliability? Your customers aren’t getting 99.9% reliability out of their phones. How are you going to fix that? Once you point out that people won’t even notice a small amount of downtime in all likelihood – you end up with a financial argument, which has an obvious answer. I can spend millions of dollars for nearly no noticeable result, or accept a more reasonable availability SLO and save that money and stress in trying to attain perfection.

A competent SRE embraces risk. Our goal is not to slow down or stop releases. It’s really about safety, not stability just for its own sake. Going back to that car analogy – If your goal is 100%, then the only thing we can do is jam on the brakes, immediately. That’s a terrible approach to driving if you think about it – it’s not getting you where you need to be. Even pacemaker companies aren’t 100% defect free; they have a documented, acknowledged failure rate. It might be one in 100M pacemakers that fail, but it still happens and that 99.9999% success rate is still the number they strive for.

Blameless postmortems: It’s counterproductive to blame other people, as it ends up hiding the truth. And it’s not just blaming people – it’s too easy sometimes to blame systems. This is an emotional thing – it gives us a warm fuzzy feeling to come up with one pat answer. But at that point we stop listing all the other factors that could have contributed to the problem.

We don’t perform postmortems for each and every incident at Google – only when we’re sure it has a root cause that’s accurate, and it could be applicable to other outages – something we can actually learn from. We’re very careful not to make never-ending lists of everything that can go wrong, and to pick out the really important things that need to be addressed. And your action items need to be balanced. Some should be comprehensive, some should be structural, some should be short-term fixes, and they can’t all be red hot in priority. Let’s say you have a lower priority action item that would need to be done by another team, for example. You might legitimately want to defer on that, instead of wasting political capital trying to drop work on other teams outside your direct control.

It’s vitally important to keep postmortems on the radar of your senior leadership. We have a daily standup meeting in one area here at google, where management looks over anything that’s happened in the past few days. We go through each postmortem, people present on the issue and the followup items they’ve identified, and management weighs in and provides feedback. This makes sure that the really important fixes are tracked through to completion.

SRE Antipatterns: The magical title change is something that crops up quite often. For example, you know how sometimes developers are given a book or sent to a training class, and then a month later they’re labeled as “Agile”? That same thing happens in the SRE world. Sometimes we see companies taking sysadmins, changing one or two superficial things, and labeling them “DevOps Engineers” or some other shiny new title. But nothing around them has really changed – incentives haven’t changed, and executives have not bought in to making changes that are truly going to be lasting and effective.

Another antipattern is charging ahead without getting that signoff from management. Executive level engagement on the SRE model, especially the part that has teeth – SLOs and Error Budgets – is a critical success/failure indicator. This is how we gauge whether we’re in the first working model – we’re reliable enough, let’s go faster – or in our second working model , customers are suffering, give us the resources we need. A numerical error budget, a target that is agreed upon and very specific consequences that happen when that budget gets violated – that needs to be consistently enforced, from the top.

A lot of times we find that it doesn’t take a lot of convincing to get executives onboard, but you do have to have that conversation. We talk to the leadership, who have an emotional need to see their company have a reliable product, and we help them understand it with numbers and measurements, instead of gut feel. The fact that once a system becomes unreliable, it can be months or even years until we can bring it back to a reliable state – that these are complex systems and it will take constant attention to keep it running smoothly.

Another antipattern that thankfully we don’t see too often is where SRE becomes yet another silo, a gatekeeper. It’s really important to crosspollinate knowledge, so production knowledge is shared. If it’s just one group that controls any and all production or release ownership and jealously guards that privilege, we’ve failed. So at Google, we do something called “Mission Control”, where developers can join an SRE team for 1-2 quarters. It’s a great way of disseminating knowledge and getting coders to see what it’s like on the other side of the fence.

DIRT and GameDays: We find that it’s absolutely vital to practice for failure. Netflix and others obviously have had a lot of success with Simian Army and Chaos Monkey, where SREs are whacking production systems at random to test availability. We use this approach somewhat at Google, our annual DIRT exercises– disaster recovery testing, which are company-wide. But locally, we use something less intimidating for newbies and entirely theoretical and very low-key – something we call a Wheel of Misfortune exercise.

It works almost like a D&D board game. It’s held once a week, and lots of SRE’s show up at the arena as onlookers, because it’s fun. There’s a gamemaster present, who throws the scenario – something that actually happened, not too long ago – on a whiteboard. A SRE takes on the role of a “player”, someone who’s handling incident response. As they walk through how they’d handle troubleshooting the incident and debugging, the gamemaster responds with what happens. A lot of times the gaps come down to documentation – where’s the playbook for this system? What information would have helped that support team get to a root cause faster? It’s great for the audience, because it’s very engaging and collaborative – a great group socialization exercise. We always end up with lots of creative action items that help us down the road.

Livesite Support: We do feel that it’s vital that development teams do some kind of production support. Now we throw around that “at least 5%” number a lot, but that’s really just a generic goal. The real aim here is to break down that palace wall, that silo between developers and operations. Many people assume that at Google every team has SRE’s, but that’s not the case. In fact our default model is 100% developer supported services end to end – SRE’s are really used more for high profile public facing or mission critical systems. The more dev ownership of production you can get, the more you’ll be able to sustainably support production.

Reducing toil is always top of mind for us. Any team tasked with operational work will have some degree of toil – manual, repetitive work. While toil can never be completely eliminated, it can and should be watched carefully. Left unchecked, toil naturally grows over time to consume 100% of a team’s resources. So it’s vital to the health of the team to focus relentlessly on tracking and reducing toil – that’s not a luxury, it’s actually necessary for us to survive.

At Google, we keep a constant eye on that toil/project work dichotomy, and we make sure that it’s capped. We’ve found that a well run SRE team should be spending no more than 50% of its time on toil. A poorly run SRE team might end up with 95% toil. That leaves just 5% of time for project work. At that point you’re bankrupt – you don’t have enough time to drive down your toil or eliminate the things that are causing reliability issues, you’re just trying to survive, completely overwhelmed. So part of that policy you agree upon with your stakeholders must be enforcing that cap on toil, leaving at least half of your capacity for improving the quality of your services. Failing to do that is definitely an antipattern, because it leads to becoming overwhelmed by technical debt.

 

References:

[sre] – Site Reliability Engineering: How Google Runs Production Systems, Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy, O’Reilly Media, April 2016, ISBN-10: 9781491929124, ISBN-13: 978-1491929124

[ghbsre] – The Site Reliability Workbook: Practical Ways to Implement SRE, By Niall Murphy, David Rensin, Betsy Beyer, Kent Kawahara, Stephen Thorne, O’Reilly Media, August 2018, ISBN-10: 1492029505, ISBN-13: 978-1492029502

[gsre] – https://landing.google.com/sre/book.html – free PDF versions of the revised [sre] text and the followup handbook.

[kieran] – “Managing Misfortune for Best Results”, 8/30/2018, Kieran Barry, SREcon EMEA, https://www.usenix.org/node/218852. This is a great overview of the Wheel of Misfortune exercises in simulating outages for training, and some antipatterns to avoid.

DevOps Stories – an Interview with Ryan Comingdeer of Five Talent Software


Ryan Comingdeer is the CTO of Five Talent Software, a thriving software consultancy firm based in Oregon with a strong focus on cloud development and architecture. Ryan has 20 years of experience in cloud solutions, enterprise applications, IoT development, website development and mobile apps.  

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

Obstacles in implementing Agile: Last week I was talking to a developer at a large enterprise who was boasting about their adoption of Agile. I asked him – OK, that’s terrific – but how often do these get out the door to production? It turns out that these little micro changes get dropped off at the QA department, and then is pushed out to staging once a month or so… where it sits, until it’s deemed ready to release and the IT department is ready – once a quarter. So that little corner was Agile – but the entire process was stuck in the mud. 

The first struggle we often face when we engage with companies is just getting these two very different communities to talk to one another. Often its been years and years of the operations department hating on the development team, and the devs not valuing or even knowing about business value and efficiency. This is hard work, but understanding that philosophy and seeing the other side of things is that vital first step. 

I know I’ve won in these discussions – and this may be 12 meetings in – when I can hear the development team agreeing to the operations teams goals, or an Operations guy speaking about development requirements. You have to respect each other and view work as a collaborative effort.  

For the development teams, often they’re onboard with change because they recognize the old way isn’t working. Often times the business throws out a deadline – ‘get this done by April 1st‘ – and when they try to drill into requirements, they get an empty chair. So they do the best they can – but there’s no measurable goals, no iterative way of proving success over an 18 month project. So they love the idea of producing work often in sprints – but then we have to get them to understand the value of prototyping, setting interim deliverables and work sizing.  

Then we get to the business stakeholders, and have to explain – this is no longer a case where we can hand off a 300-page binder of requirements and ask a team to ‘get it done’. The team is going to want us involved, see if we’re on the right track, get some specific feedback. Inevitable we get static over this – because this seems like so much more work. I mean, we had it easy in the old days – we could hand off work, and wait 12 months for the final result. Sure the end result was a catastrophic failure and everybody got fired, but at least I wasn’t hassled with all these demos and retrospectives every two weeks! That instant feedback is really uncomfortable for many leaders – there’s no insulation, no avoidance of failure. It does require a commitment to show up and invest in the work being done as it’s being done. 


Retrospectives for me are one of the best things about Agile. I wish they were done more often. We do two, one internally, then a separate one with the customer so we’re prepared – and we’re upfront, here’s where we failed, here’s the nonbillable time we invested to fix it. You would think that would be really damaging, but we find it’s the opposite. The best thing a consulting company can do is show growth, reviewing successes and failures directly and honestly to show progress. Our relationships are based on trust – the best trust building exercise I’ve seen yet is when we admit our failure and what we’re going to do to fix it. I guarantee you our relationship with the customer is tighter because of how we handled a crisis – versus attempting to hide, minimize, or shift blame.  

Implementing DevOps: It’s very common that the larger organizations we work with aren’t sure of where to start when it comes to continuous integration or CD. Where do I begin? How much do I automate? Often it comes down to changing something like checking in a new feature say after two weeks of work. That’s just not going to cut it – what can we deliver in 4 hours?  

That being said, CI/CD is Step 1 to DevOps; it’s fundamental. Infrastructure as Code is further down the list – it takes a lot of work, and it’s sometimes hard to see the value of it. Then you start to see the impact with employee rotation and especially when you have to rollback. And think about how much easier it makes it when you have to rollback changes – you can see what was changed and when; without it, you might be stuck and have to fix a problem in place. The single biggest selling point for Infrastructure as Code is security; you can demonstrate what you’re doing to regulate environments, you can show up to an audit prepared with a list of changes, who made them and what they were, and a complete set of security controls.   

A True MVP: Most of the companies we work with come to us because they’ve got a huge backlog of aging requests, these mile-long wish lists from sales and marketing teams. We explain the philosophy behind DevOps and the value of faster time to market, small iterations, and more stable environments and a reliable deployment process. Then we take those huge lists of wishes and break them down into very small pieces of work, and have the business prioritize them. There’s always one that stands out – and that’s our starting point. 

The first sprint is typically just a proof of concept of the CI/CD tools and how they can work on that top #1 feature we’ve identified. The development team works on it for perhaps 2 days, then sysops takes over and uses our tooling to get this feature into the sandbox environment and then production. This isn’t even a beta product, it’s a true MVP – something for friends and family. But it’s an opportunity to show the business and get that feedback that we’re looking for – is the UI ok? How does the flow look? And once the people driving business goals sit down and start playing with the product on that first demo, two weeks later, they’re hooked. And we explain – if you give us your suggestions, we can get them to staging and then onto production with a single click. It sells itself – we don’t need long speeches.  

The typical reaction we get is – “great, you’ve delivered 5% of what I really want. Come back when it’s 100% done.” And the product is a little underwhelming. But that’s because we’re not always sticking to the true definition of a minimum viable product (MVP). I always say, “If an MVP is not something you’re ashamed of, it’s not a MVP!” Companies like Google and Amazon are past masters at this – they throw something crude out there and see if it sticks. It’s like they’re not one company but 1,000 little startups. You’ve got to understand when to stop, and get that feedback. 

I’ve seen customers go way down in the weeds and waste a ton of money on something that ends up just not being viable. One customer I worked with spent almost $250K and a year polishing and refactoring this mobile app endlessly, when we could have delivered something for about $80K – a year earlier! Think of how the market shifted in that time, all the insights we missed out on. Agile is all about small, iterative changes – but most companies are still failing at this. They’ll make small changes, and then gate them so they sit there for months. 

When we start seeing really progress is when the product is released ahead of deadline. That really captures a lot of attention – whoa, we wanted this app written in 15 months, you delivered the first version in two weeks – nine months in we can see we’re going to be done 4 months early because of our cadence.  

So here’s my advice – start small. Let me give you one example – we have one customer that’s a classic enterprise – they’ve been around for 60 years, and it’s a very political, hierarchical climate, very waterfall oriented. They have 16 different workloads. Well, we’re really starting to make progress now in their DevOps transformation – but we never would have made it if we’d tried this all-in massive crusade effort. Instead, we took half of one workload, as a collection of features and said – we’re going to take this piece and try something new. We implemented Agile sprints and planning, setup automated infrastructure, and CI/CD. Yeah, it ruffled some feathers – but no one could argue with how fast we delivered these features, and how much more stable they were, and how happy the customers were because we involved them in the process.  

The biggest problem we had was – believe it or not – getting around some bad habits on having meetings for the sake of having meetings. So we had to set some standards – what makes for a successful meeting? What does a client acceptance meeting look like?

Even if you’re ‘just a developer’, or ‘just an ops guy’, you can create a lot of change by the way you engage with the customer, by documenting the pieces you fill in, by setting a high standard when it comes to quality and automation.  

Documentation: I find it really key to write some things down before we even begin work. When a developer gets a two week project, we make sure expectations are set clearly in documentation. That helps us know what the standards of success are, gets QA on the same page – it guides everything that we do.  

I also find it helps us corral the chaos caused by runaway libraries. We have a baseline documentation for each project that sets the expectation of the tools we will use. Here, I’ll just say – it’s harder to catch this when you’re using a microservice architecture, where you have 200 repos to monitor for the Javascript libraries they’re choosing. Last week, we found this bizarre PDF writer that popped up – why would we have two different PDF generators for the same app? So we had to refactor so we’re using a consistent PDF framework. That exposed a gap in our documentation, so we patch that and move on. 

Documentation is also a lifesaver when it comes to onboarding a new engineer. We can show them the history of the project, and the frameworks we’ve chosen, and why. Here’s how to use our error logging engine, this is where to find Git repos, etc. It’s kept very up to date, and much of it is customer facing. We present the design pattern we’ll be using, here’s the test plans and how we’re going to measure critical paths and handle automated testing. That’s all set and done before Day 1 with the customer so expectations are in line with reality. 

We do use a launch checklist, which might cover 80% of what comes up – but it seems like there’s always some weird gotchas that crop up. We break up our best practices by type –for our Microsoft apps, IOT, monoliths, or mobile – each one with a little different checklist.  

It’s kind of an art – you want just the right amount, not too much, not too little. When we err, I think we tend to over-document. Like most engineers, I tend to overdo it as I’m detail-oriented. But for us documentation isn’t an afterthought, they’re guardrails. It sets the rules of engagement, defines how we’re measuring success. It’s saved our bacon, many times!  

Microservices: You can’t just say ‘microservices are only for Netflix or the other big companies’. It’s not the size of the team, but the type of the project. You can have a tiny one-developer project and implement it very successfully with microservices. It does add quite a bit of overhead, and there’s a point of diminishing returns. We still use monolith type approaches when it comes to throwaway proofs of concept, you can just crank it out.  

And it’s a struggle to keep these services discrete and finite. Let’s say you have a small application, how do you separate out the domain modules for your community area and say an event directory so they’re truly standalone? In the end you tend to create a quasi-ORM, where your objects have a high dependency on each other; the microservices look terrific at the app or the UI layer, but there’s a shared data layer. Or you end up with duplicated data, where the interpretation of ‘customer’ data varies so much from service to service.  

Logging is also more of a challenge – you have to put more thought into capturing and aggregating errors with your framework. 

But in general, microservices are definitely a winner and our choice of architecture. Isolation of functionality is something we really value in our designs; we need to make sure that changes to invoicing won’t have any effect on inventory management or anything else. It pays off in so many ways when it comes to scalability and reliability.  

Testing: We have QA as a separate functional team; there’s a ratio of 25 devs to every QA person. We make it clear that writing automated unit tests, performance tests, security tests – that’s all in the hands of the developers. But manual smoke tests and enforcing that the test plans actually does what it’s supposed to is all done by the QA dept. We’re huge fans of behavior driven development, where we identify a test plan, lay it out, the developer writes unit tests and QA goes through and confirms that’s what the client wanted.

With our environments, we do have a testing environment set up with dummy data; then we have a sandbox environment, with a 1 week old set of actual production data where we do performance and acceptance testing. That’s the environment the customer has full access to. We don’t do performance testing against production directly. We’re big fans of using software to mimic production loads – anywhere from 10 users/sec to 10K users/sec, along with mocks and fakes with our test layer design. 

Continuous Learning: To me continuous learning is really the heart of things. It goes all the way back to the honest retrospectives artifact in scrum – avoiding the blame game, documenting the things that can be improved at the project or process level. It’s never the fault of Dave, that guy who wrote the horrible code – why did we miss that as a best practice in our code review? Did we miss something in how we look at maintainability, security, performance? Are lead developers setting expectation properly, how can we improve in our training?  

Blame is the enemy of learning and communication. The challenge for us is setting the expectation that failure is an expected outcome, a good thing that we can learn from. Let’s count the number of failures we’re going to have, and see how good our retrospectives can get. We’re going to fail, that’s OK – how we learn from these failures?  

Usually our chances of winning come down to one thing – a humble leader. If the person at the top can swallow their pride, knows how to delegate, and recognize that it will take the entire team to be engaged and solve the problem – then a DevOps culture change is possible. But if the leader has a lot of pride, usually there’s not much progress that can be made.  

Monitoring: Monitoring is too important to leave to end of project, that’s our finish line. So we identify what the KPI’s are to begin with. Right now it revolves around three areas – performance (latency of requests), security (breach attempts), and application logs (errors returned, availability and uptime). For us, we ended up using New Relic for performance indicators, DataDog for their app layer KPI’s, and Amazon’s Inspector. The OWASP has a set of tools they recommend for scanning; we use these quite often for our static scans.  

Sometimes of course we have customers that want to go cheap on monitoring. So, quite often, we’ll just go to app level errors; but that’s our bare minimum. We always log, sometimes we don’t monitor. We had this crop up this morning with a customer – after a year or more, we went live, but all we had was that minimal logging. Guess what, that didn’t help us much when the server went down! Going bare-bones on monitoring is something customers typically regret, because of surprises like that. Real user monitoring, like you can get with any cloud provider, is another thing that’s incredibly valuable checking for things like latency across every region.  

Production Support by Developers: Initial on-calls support is handled in-house by a separate Sysops team; we actually have it in our agreement with the customer that application developers aren’t a part of that on-call rotation. If something has made it through our testing and staging environments, that knocks out a lot of potential errors. So 90% of the time a bug in production is not caused by a code change, it’s something environmental – a server reboot, a firewall config change, a SSL cert expires. We don’t want to hassle our developers with this. But, we do have them handle some bug triage – always during business hours though.  

Let’s just be honest here – these are two entirely separate disciplines, specialties. Sysops teams love ops as code and wading through server error logs – developers hate doing that work! So we separate out these duties. Yes, we sometimes get problems when we move code from a dev environment to QA – if so, there’s usually some information missing that the dev needs to add to his documentation in the handoff to sysops.  

And we love feature flags and canary releases. Just last week we rolled out an IOT project to 2000 residential homes. One feature we rolled out to only the Las Vegas homes to see how it worked. It works great – the biggest difficulty we find is documenting and managing who’s getting new features and when, so you know if a bug is coming from a customer in Group A or B. 

Automation: For us, automating everything is the #1 principle. It reduces security concerns, drops the human error factor; increases our ability to experiment faster with infrastructure our codebase. Being able to spin up environments and roll out POC’s is so much easier with automation. It all comes down to speed. The more automation you have in place, the faster you can get things done. It does take effort to set up initially; payoff is more than worth it. Getting your stuff out the door as fast as possible with small, iterative changes is the only really safe way; that’s only possible with automation. 

You would think everyone would be onboard with the idea of automation over manually logging on and poking around on VM’s when there’s trouble, but – believe it or not – that’s not always the case. And sometimes our strongest resistance to this comes from the director/CTO level!  

Security: First, we review compliance with customer – half the game is education. We ask them if they’re aware of what GDPR is – for 90% of our customers, that’s just not on their radar and it’s not really clear at this point what compliance means specifically in how we store user information. So we give them papers to review, and drop tasks into our sprints to support compliance for the developers and the sysops team with the CI/CD pipeline.  

Gamedays: Most of my clients aren’t brave enough to run something like Simian Army or Chaos Monkey on live production systems! But we do gamedays, and we love them. Here’s how that works: 

I don’t let the team know what the problem is going to be, but one week before launch – on our sandbox environments, we do something truly evil to test our readiness. And we check how things went – did alerts get fired correctly by our monitoring tools? Was the event logged properly? How did the escalation process work, and did the right people get the information they needed fast enough to respond? Did they have the access they needed to make the changes? Were we able to use our standard release process to get a fix out to production? Did we have the right amount of redundancy on the team? Was the runbook comprehensive enough, and were the responders able to use our knowledgebase to track down similar problems in the past to come up with a remedy?  

The whole team loves this, believe it or not. We learn so much when things go bump in the night. Maybe we find a problem with auto healing, or there’s an opportunity to change the design so the environments are more loosely coupled. Maybe we need to clear up our logging, or tune our escalation process, or spread some more knowledge about our release pipeline. There’s always something, and it usually takes us at least a week to fold in these lessons learned into the product before we do a hard launch. Gamedays are huge for us – so much so, we make sure it’s a part of our statement of work with the customer.  

For one recent product, we did three Gamedays on sandbox and we felt pretty dialed in. So, one week before go-live, we injected a regional issue on production – which forced the team to duplicate the entire environment into a completely separate region using cold backups. Our SLA was 2 hours; the whole team was able to duplicate the entire production set from Oregon to Virginia datacenters in less than 45 minutes! It was such a great team win, you should have seen the celebration.  

 
 


 

Do you think you have a book in you? I do too.

You can’t wait for inspiration. You have to go after it with a club. – Jack London

Almost done with my novel, I can sense it. I think there’s about a month left until the first draft is done, and – if all goes well – all the revisions will be done and it’ll be published, perhaps December or January. I’m very proud of it so far, and I think – I hope – it’ll leave a mark. I wanted to share with you what I learned, because I almost waited too long.

I used to think you needed to be brilliant, or wait for inspiration in some café. Turns out, that’s really not the case. What I found is, it’s just a grind. You show up at the café, and you start writing, 8 in the morning – and you don’t get up until at least 2 pm. If you do that, you’ll have at least 1,000 words down – and maybe more like 3,000.

They may not be good. Some days you’ll struggle cranking out 1,000 wimpy little words, and it’ll be hot garbage. Other days, you’ll fly through, and it’ll sing off the page. Regardless – you do it every day. Five days a week, as best you can.

Guess what? After 6-9 months, you’ve got yourself a first draft of a nice little book there.

Of course, we’re not DONE yet. Now you’ve got to rewrite, where you take that pile of bricks and try to make it into a house. But, my friend, you are ALMOST there. And all it took was sitting down and writing that first page.

Do everything they tell you to do. Tell your friends that you’re writing a book, so you’re committed. (That was huge for me. Telling people I was a writer gave me a little ego boost and I found, over time, it actually became true.) Make it a topic you really like – you don’t want to spend a year or more of your life with something you aren’t truly interested in. And find a publisher before you invest too much time in your book, and (hopefully) get a contract. A good editor will help with guiding you so what you write will be worth reading. Even if you end up self-publishing, going through the work of putting together a proposal and an outline is so worth it.

For me, I really enjoyed the research phase. But if you’re not careful, you’ll spend all your time studying and looking through other people’s work – and not doing any of your own. So, on the bad days, sometimes I’d do very little writing, just research. But usually I’d force myself to write those 1,000 words first – and THEN treat myself with a book or video for research.

Just don’t wait too long. One of my favorite authors is Norman Maclean, who wrote “A River Runs Through It” and – posthumously – “Young Men and Fire”, both incredible classics. The tragic thing was, he started so late – when he was 71 years old! It’s such a terrible waste.

Don’t wait for inspiration. What you’ve got to say is something that needs to be shared, that will add value to the world. Set a goal, tell your friends about it, and start plugging away.

So, do you think you have a book in you? Something amazing and creative, something you’ve never seen anywhere before? I do too. And I can’t wait to read your first book!

DORA 2018 State of DevOps report is out!

Hey guys the 2018 State of DevOps report from Puppet/DORA is out! As always, those guys have done an amazing job. You owe it to yourself to download it and check it over, and pass it along.

Here’s the points I found most powerful:

  1. DevOps isn’t a fad; it’s proven to make companies faster and less wasteful in producing new features.
  2. Slower is not safer. Companies releasing every 1-6 months had abysmally slow recovery times.
  3. We can’t eliminate toil or manual work completely – but in low performing companies, it’s basically all we do. High-performers rarely have it make more than 30% of the workday.
  4. Outsourcing an entire function – like QA, or production support – remains a terrible idea. It represents a dramatic cap on innovation and ends up costing far more in delays than you’ll ever see with saved operational costs.
  5. Shift Left” on security continues to grow in popularity – because it works. The best examples are where implementing it early is made as easy and effortless as possible.    

More below. Check it out for yourself, it’s such great work and very easy to read!

 

The difference between the greats and the not-so-great continues to widen: We’ve heard executives describe DevOps as being a “buzz word” or a “fad”. Ten years into this movement, this seems more and more out of touch with reality. Companies that take DevOps seriously as part of their DNA perform better. They deploy code 46x more frequently; they’re faster to innovate (2,555 times faster lead time). And they do it more safely. Elite performers have 7x lower change failure rate, and can recover 2,604x faster.

DevOps has been proven to lead to faster innovation and change AND produce higher quality work. Honestly, does that sound like a fad to you? (I wonder sometimes if the GM and Chrysler execs in the 1970’s were saying the same thing about Toyota…)

(above image and all others copyright Puppet/DORA 2018)

Releasing infrequently for “safety” is anything but. Many organizations gate releases so they’re spread out over weeks or months, in an attempt to prevent bugs or defects. This backfires terribly; while bug rates may drop, it means their time to recover is disastrously slow. For example, companies that release every 1-6 months have the exact same MTTR – 1-6 months. (!!!!)

“When failures occur, it can be difficult to understand what caused the problem and then restore service. Worse, deployments can cause cascading failures throughout the system. Those failures take a remarkably long time to fully recover from. While many organizations insist this common failure scenario won’t happen to them, when we look at the data, we see five percent of teams doing exactly this—and suffering the consequences. At first glance, taking one to six months to recover from a system failure seems preposterous. But consider scenarios where a system outage causes cascading failures and data corruption, or when multiple unknown systems are compromised by intruders. Several months suddenly seems like a plausible timeline for full recovery.”

Toil and manual work: Elite and high performing orgs do far less manual work. Just look at the percent of people’s time wasted in low performing orgs doing things like hacking out manual configs on a VM, or smoketesting, or trying to push a deployment out the door using Xcopy. Someone on an elite, high performing company might spend 20-30% of their time doing this type of shovel work; at lower performing companies, it’s basically 100% plus of their time.

 

Think Twice Before You Outsource: The powerful example of Maersk shows the cost of outsourcing entire functions (like testing, or Operations) to external groups. The 2018 study proves that outsourcing an entire function leads to delays as work is batched and high-priority items wait on lower-priority work in queue. This is the famous handoff waste and directly against key DevOps principles of cross functional teams:

“Analysis shows that low-performing teams are 3.9 times more likely to use functional outsourcing (overall) than elite performance teams, and 3.2 times more likely to use outsourcing of any of the following functions: application development, IT operations work, or testing and QA. This suggests that outsourcing by function is rarely adopted by elite performers. …Misguided performers also report the highest use of outsourcing, which likely contributes to their slower performance and significantly slower recovery from downtime. When working in an outsourcing context, it can take months to implement, test, and deploy fixes for incidents caused by code problems.”

In Maersk’s case, just the top three features represented a delay cost of $7 million per week. So while outsourcing may seem to represent a chance to cut costs, data shows that the delay costs and drag on your deployment rate may far outweigh any supposed savings.

Lean product management: the survey went into some detail about the qualities of Lean Product Management that they found favorable. Here’s a snapshot:

Security by audit versus part of the lifecycle: Great thoughts on how shifting left on security is a key piece of delivery. They recommend making security easy, with frameworks of preapproved libraries, packages and toolchains, and reference examples of implementation, versus late-breaking audits and the disruption and delays that causes:

“Low performers take weeks to conduct security reviews and complete the changes identified. In contrast, elite performers build security in and can conduct security reviews and complete changes in just days. …Our research shows that infosec personnel should have input into the design of applications and work with teams (including performing security reviews for all major features) throughout the development process. In other words, we should adopt a continuous approach to delivering secure systems. In teams that do well, security reviews do not slow down the development process.”

 

So, that’s my book report. Loved it, as always, though I’m not onboard with everything there. For example, they’ve coined a new phrase – SDO, “Software Delivery and Operational Performance.” Sorry, but to me that’s reliability – the “R” in SRE, which has been around since 2003 in the software world. I don’t see the need for another acronym around that. And they’re splitting hairs a little when separating out automated testing from continuous testing, but I might be wrong on that.

As usual, it’s brilliant, data-driven, and really sets the pace for the entire growing movement of DevOps. LOVE, love the work that Puppet and DORA are producing – keep it up guys!