A few weeks back I finally broke the cover on a book I’ve been meaning to read for some time. Books by / from Amazon architects are actually quite hard to come by, unlike with Google or Microsoft. This seems to be the best written outline I’ve seen yet of Amazon’s cloud adoption framework and some of the lessons they’ve learned in doing large-scale enterprise migrations to the cloud. I was quite shocked by some of the conclusions Stephen presented, as they challenged some of my preconceptions around lift-and-shift. I thought I’d write a little about what I learned as a kind of Cliff Notes to encourage you to check out the book.
It may seem a little odd that I, a lifelong Microsoftie, am writing a review about a book so slanted towards AWS. (And it is, sorry! If all you ever read on the cloud was this book, you’d think Amazon has the only cloud platform in existence.) But I believe the principles in this book – especially around the “halo effect” and the importance of training, and the different migration strategies that Amazon has found viable – are not specific to Amazon/AWS. If you’ve ever read Amazon’s leadership principles, they preach about being customer-focused – not competitor focused. One point of view is sideways – and leads to stagnation / me-too type thinking. The second aims to learn, and is forward thinking. Count me in on that second group. I’m an admirer of Amazon – especially with the two-pizza rule and microservices – and I think Stephen Orban has a lot of experience that everyone can learn from.
Book: “Ahead in the Cloud: Best Practices for Navigating the Future of Enterprise IT” by Stephen Orban
Sayings to Live By
- Everyone you need to move forward with the cloud is already there, you just have to enable them.
- Practice makes permanent.
- All of your assumed constraints are debatable.
- “Reform the environment and not man; being absolutely confident that if you give man the right environment, he will behave favorably.”—BUCKMINSTER FULLER
- Use your migration as a forcing function to adopt a DevOps culture
- “You get the culture you pay for.” – Adrian Cockroft
- “There’s no compression algorithm for experience.” – Bryan Landerman, Chief Technology Officer, Cox Automotive
Lift and Shift Is Not An Antipattern: Four Different Migration paths
This is in the book, and it’s by far the best part – Stephen outlines four different paths for a migration, from a lift-and-shift approach to a full-on rearchitecture:
(above courtesy “Ahead In The Cloud”)
Is Lift and Shift a Copout? This was the biggest learning point for me from the book. Before this, I’d always assumed that lift-and-shift was little more than a cop-out. Stephen makes the point that this is often the default, best initial first choice: Most of our enterprise customers segment their applications into tranches: those that are easy to move to the cloud, those that are medium hard, and those that are hardest. They also segment applications into those that can easily be lifted and shifted into the cloud and those that need to be re-architected before moving to the cloud. … I’ve heard a lot of executives—including myself, before I learned better—suggest that they’re only moving to the cloud if they “do it right,” which usually means migrating to a cloud-native architecture. … I’ll hear from senior executives who don’t want to take any of their legacy workloads to the cloud; instead, they want to focus on developing net new architectures using serverless services like AWS Lambda.
When I was the CIO at Dow Jones several years ago, we initially subscribed to the ivory tower attitude that everything we migrated needed to be re-architected, and we had a relentless focus on automation and cloud-native features. That worked fine until we had to vacate one of our data centers in less than two months. [Stephen points out that you gain a quick budget win this way, “which tends to be in the neighborhood of 30 percent when you’re honest about your on-premises TCO.”] …GE Oil & Gas rehosted hundreds of applications to the cloud as part of a major digital overhaul. In the process, they reduced their TCO by 52 percent. Ben Cabanas, then one of GE’s most forward-thinking technology executives, told me a story that was similar to mine—they initially thought they’d re-architect everything, but soon realized that would take too long, and that they could learn and save a lot by rehosting first. …One customer we worked with rehosted one of its primary customer-facing applications in a few months to achieve a 30 percent TCO reduction, then re-architected to a serverless architecture to gain another 80 percent TCO reduction!
He makes the following arguments around lift-and-shift versus a full-on cloud native approach from the get-go:
- Time: rehosting takes a lot less time,
- Rearchitecture is easier on the Cloud: it becomes easier to re-architect and constantly reinvent your applications once they’re running in the cloud. “I believe the ability of these applications to perform and evolve is just as much dependent on their environment as the code or DNA that governs their behavior. The argument I’d like to make here is that the AWS cloud provides a better environment—in terms of size and diversity of services—that is well beyond what most on-premises data centers can provide.” There’s another example of applying ElasticSearch to cheaply add full-text search capabilities without an expensive and risky move to NoSQL clusters.
- Performance and Cost Savings: You realize some immediate benefits. Besides budget/TCO (see above), this also means better performing apps. SSD’s are 2-5x faster than spinning disks for example – so moving a database to SSD-backed instances can yield amazing results for little to nothing. “One customer I know had an application that was in a critical period and realized there were some bad queries causing performance bottlenecks. Changing the code was too risky, so the database server was upped to an X1 instance and then ramped back down to a more reasonable instance size once the critical period was over.”
Pilots and experimentation
Give your teams a hands-on, time-constrained opportunity to do something meaningful to your business with the cloud, and see what happens. Ask them to build a website, create an API for some of your data, host a wiki, or build something else tangible that fits into what your teams already do. I’m always surprised by how quickly the right motivation mixed with a bit of time pressure can lead to results. Scarcity breeds invention.
Innovation comes from experimentation, and because the cloud takes away the need for large up-front investments to try new things, there is nothing holding your team back from creating the next disruptive product in your industry. Give your team some freedom to implement existing projects in new ways.
Generally speaking, I like to see organizations start with a project that they can get results from in a few weeks. …What I’ve found most important is that organizations pick something that will deliver value to the business, but something that isn’t so important that there’s no appetite for learning. The first engineering team you put together should consist of a thorough mix of core skills—Network, Database, Linux Server, Application, Automation, Storage, and Security. The team will make some progress. It will probably look at tools like Terraform and others. It will also write some AWS CloudFormation code. The team will make mistakes. All of this is perfectly natural.
You don’t need access to capital to experiment. Throughout my career, I’ve spent countless hours trying to justify the ROI on a capital investment for resources I thought were needed for a potential product. I was rarely, if ever, able to get capacity planning right, and almost always overbuilt my infrastructure. In a few cases, it took my team longer to justify an investment than it took to build the first version of the product.
…don’t experiment too early in your journey with a project where your stakeholders demand a specific outcome. I wouldn’t advise that you start experimenting with your end-of-year billing run, for instance. A CEO I once worked for told me that it’s okay to fail, except when it isn’t. Be satisfied with incremental progress and slowly increase the number of experiments you run, but don’t outpace the organization.
Make Sure It’s Measurable: DON’T pursue an experiment until you know how to measure it. You want to spend time on the right experiments and ensure the lessons… Mature DevOps organizations also develop A/B testing frameworks that allow them to experiment on slightly different user experiences with different user cohorts in parallel to see what works best. In my brief tenure so far at Amazon, I’ve found that anyone able to think through and articulate an experiment in writing typically gets the opportunity to try it. This is a special part of our culture and a great tool for attracting and retaining innovators and builders.
Audits, ITSM and Security: Your friend, not the enemy?
Stephen points out that our old friends ITIL, ITSM, etc are truly “old” friends – developed in a previous era to standardize the way IT operates in large enterprises. They made sense at the time, but haven’t aged well in the era ofr scalable resources. (i.e. they may be good at controlling costs, but is it worth it if it takes weeks to get a firewall port opened for a resource that can be spun up on-demand in seconds?)
This is somewhat a repeat of the DevSecOps movement / “Shift Left” on security, but he makes a good point:
“Audits are your friend, not your enemy. Use them to educate everyone that you’re better off with the new rules that you’re making and get feedback. Collaborate with your auditors early and often, and explain what you’re trying to accomplish. Get their input and I’m sure they’ll improve your thinking and your results… Once we illustrated that our controls were greatly improved because of the new rules we were employing around automation, our auditors became more comfortable with our future direction. By showing them early that we no longer had ownership spread across siloed teams sitting next to one another but communicating through tickets, and that the opportunity for human mistakes was much less”, resistance dropped.
One of the key points that Stephen makes is that automation can be applied to these (formerly late-stage) audit steps as well. If compliance rules is applied to infrastructure as code, “the compliance team can validate legal and security requirements every time the system is changed, rather than relying on a periodic system review”.
Do You Have a Cloud Center of Excellence?
Stephen wrote that one of the best decisions he made at Dow Jones was creating a CCoE to codify how their cloud strategy would work and be executed across the org. Here’s some points around creating a COE and making sure it doesn’t become more of a hindrance than a help:
- Makeup and where to start: I recommend putting together a team of three to five people from a diverse set of professional backgrounds. Try to find developers, system administrators, network engineers, IT operations, and database administrators. These people should ideally be open-minded and eager about how they can leverage modern technology and cloud services to do their jobs differently and take your organization into the future. …Start with the basics: roles and permissions, cost governance, monitoring, incident management, a hybrid architecture, and a security model. Over time, these responsibilities will evolve to include things like multi-account management, managing “golden” images, asset management, business unit chargebacks, and reusable reference.
- And make it metrics-oriented:
Organizations that do this well set metrics or KPIs for the CCoE and measure progress against them. I’ve seen metrics range from IT resource utilization, to the number of releases each day/week/month as a sign of increasing agility, to the number of projects the CCoE is influencing. Couple these with a customer-service centric approach, and other business units will want to work with your CCoE because they find value and because the CCoE is a pleasure to work with.
- Reference architecture: How can you build security and governance into your environment from the very beginning, and rely on automation to keep it up to date? If you can find and define commonalities in the tools and approaches you use across your applications you can begin to automate the installation, patching, and governance of them. You may want one reference architecture across the whole enterprise that still gives business units flexibility to add in what they need in an automated way. Alternatively, you might want multiple reference architectures for different classes or tiers of applications.
- Start small:
I encourage companies wanting to shift to a DevOps culture to do so in a DevOps fashion—start with small projects, iterate, learn, and improve. I encourage them to consider implementing strategies that produce commonly accepted practices across the organization, and to begin embracing the idea that, when automated, ongoing operations can be decentralized and trusted in the hands of many teams that will run what they build.
- Don’t make the CCOE another stage gate:
Since developers will be the ones most intimately familiar with the nuances of the system, they will likely be able to address issues the fastest. And by using automation, it is easy to methodically propagate changes and roll back or address issues before they impact customers. I encourage centralized DevOps teams to do what they can to make development teams increasingly independent, and not be in the critical path for ongoing operations/releases. …Instead of saying, “You can’t use that to do your job,” ask “What are you trying to accomplish and how can I help you be more effective?” Every time an app team implements a workaround for something the DevOps team can’t deliver, there’s an opportunity for the organization to learn how and why that happened, and decide if they should do things differently moving forward.
- Product ownership is the end game: Ownership simply means that any individual responsible for a product or service should treat that product or service as his or her own business. Products and services can take any number of forms: a website, a mobile application, the company’s e-mail service, desktop support, a security tool, a CMS, or anything that you deliver to your customer. …I try to encourage executives to make run what you build a crucial tenet.
- Why clearly defined roles are important: Our programs and teams have a culture that establishes tenets to help guide decisions and provide focus and priorities specific to their area. My recommendation is to define a set of cloud tenets to help guide you to the decisions that make the most sense for your organization. As one of my colleagues at AWS says, “Tenets get everyone in agreement about critical questions that can’t be verified factually.” For example— Do you want application teams to have full reign and control over all the services available in AWS, or should you enforce service standards or provide additional control planes on top of AWS? … First, we broke down the silos by defining a clear IT purpose. Then we thought about the main functions needed to reach our purpose. From there we turned each function into a group by defining the group’s purpose, the group’s domains (what the group owns), and the group’s accountabilities. The next step was to break each group into sub-groups and roles which are needed to reach that group’s purpose. For every sub-group and role, we defined their purpose, domains and accountabilities, and so on.
The Disruptive Power of the Cloud, and avoiding lockin
Stephen talks at length about the power of the cloud as a disruptive force, which he defines as “the on-demand delivery of information technology (IT) resources via the internet with pay-as-you-go pricing”. He mentions that since the inception of the Fortune 500 in 1955, between 20 and 50 companies fall off the list each year. Advances in technology are largely behind this steady rate of turnover, with the cloud being the most recent cause of large-scale disruption.
I particularly enjoyed the disclosure one competitor shared around FUD/ vendor lock-in:
“The only way we can salvage our market share for now is to fuel [fear] because the hard truth is that we simply do not have the arsenal to counter AWS’s dominance. More importantly, we constantly bombard these messages (vendor lock-in, security, et al) with the operational executives that are still (a vast majority in large enterprises) stuck in the traditional IT thinking and their existence threatened by the cloud wave.”
Having worked for many years at organizations that would take months to implement (badly) new infrastructure, I can only agree; faced with nimble cloud competitors, we constantly got static about security and lock-in. In the age of serverless and IAC, it seems like such an anachronism.
That being said, I share some of the concerns around vendor lock-in. Stephen tries to dismiss this, somewhat glibly, countering that well-automated systems are ultimately portable:
What scares me is when companies fall into the trap of trying to architect a single application to work across multiple different cloud providers. I understand why engineers are attracted to this—it is quite an accomplishment to engineer the glue that is required to make different clouds work together. Unfortunately, this effort eats into the productivity gains that compelled the organization to the cloud in the first place… Companies that architect their applications using known automation techniques will be able to reliably reproduce their environments. This best practice is what enables them to take advantage of the elastic properties of the cloud, and will decouple the application from the infrastructure. If done well, it becomes less of a burden to move to a different cloud provider if there is a compelling reason to do so.
Using the Cloud To Fuel Innovation: (from Mark Schwartz): Most enterprises have not optimized for agility. If anything, they have optimized for efficiency – for doing what they do at the lowest cost. … I came to realize that a private cloud is not really a cloud at all, and it certainly is not a good use of company resources. One customer we work with, for example, has developed a business case around developer productivity. The customer (rightfully) believes that by migrating its data centers to AWS, and training its developers in the process, each of its 2,000 developers will be 50 percent more productive than they are today. Driven by the elimination of wait time for infrastructure provisioning—and access to more than 80 services they’d otherwise have to build/procure individually—this productivity boost will lead to an additional 1,000 years of developer capacity…each year. The customer intends to use this additional productivity to fund 100 new projects of 10 people each in an effort to find net new growth opportunities. …We’ve found that as much as 10 percent (I’ve seen 20 percent) of an enterprise IT portfolio is no longer useful, and can simply be turned off.
The real goal of your “digital transformation” – which he says is total BS! – “is not about a transformation that has a finite end state. It’s about becoming an organization that is capable of quickly deploying technology to meet business needs, regardless of where the technology comes from.”
Don’t make a common tenet-writing mistake—creating a tenet that applies to many projects and communicates virtually no information, such as, “We will have world-class cloud capabilities.” Instead, think specific – if the pain point is the ability to provision and manage cloud services as fast as consuming the “native” platform directly—”Provision as fast as with a credit card.” Give your app teams the control / ability to consume cloud services without artificial barriers.
Leadership Is the Differentiator
From Andy Jassy (the CEO of AWS): “the single biggest differentiator between those who talk a lot about the cloud and those who have actual success is the senior leadership team’s conviction that they want to take the organization to the cloud.” He mentions the example of Jamie Miller, who in her cloud migration kickoff announced that GE was going to move 50 apps to AWS over 30 days. This was disruptive, and didn’t meet that aggressive goal initially – but it ended up working.
“… In my experience, it can be fatal if you don’t have the support of a single-threaded executive leader during the transition. This leadership function simply can’t be delegated. The CIO, or, at the very least, a direct report of the CIO has to lead this effort and be visible each and every day to provide direction and remove roadblocks.”
Leadership Beyond Memos: Early in my executive career, I was somewhat naive in thinking that, just because I issued a department-wide directive, everyone’s actions would follow. It wasn’t until I identified the things that were really important and communicated them over and over and over again that they started to stick. … I learned the hard way that this is, of course, not how leadership works. It wasn’t until I started to clearly articulate what was important about our strategy that the behavior of my team started to change. Before presenting a new idea or goal to my team, I had to consider how everyone fit into this strategy and how it tied back to the business and everyone’s careers. Then, I had to capitalize on every opportunity to reinforce these points. This meant talking strategy at quarterly town halls, on internal blogs, during sprint planning sessions, and using every meeting as an opportunity to relate the work being discussed back to our strategy. Sometimes it felt redundant, but the bigger your team is, the less likely each individual regularly hears from you. Remaining determined and being consistent with your communication is key.
Take Into Consideration the Audience: Stephen talks about the different motives/backgrounds of the roles you’ll be engaging with:
- CFOs are typically attracted to lower up-front costs and the ability to pay only for what you use.
- CMOs are typically looking to keep the company’s brand fresh and respond to changing market conditions.
- VPs of HR will want to see that you’re looking after your staff properly and how you’re hiring for new skills.
…most of the hard work at the executive level revolved around understanding each executive’s pain points, what they wanted to get out of IT, and aligning technology to help them meet their goals. After a few months of using the cloud to deliver better results faster, we spent several months retraining the executive team and their departments to refer to us as technology instead of IT. ….I set a goal to take a few executives out for a meal each month. During the time we spent together, I did nothing but listen to their frustrations. I used what I learned to adjust our strategy, and made sure that I communicated back to them how their influence altered our direction.
Technology is Not A Cost Center: I’d argue that today’s IT executive needs to play the role of the Chief Change Management Officer (which I’ll refer to as a CCMO). Technology can no longer be viewed as something that simply supports the business. … a great way for leaders to address this friction is to give everyone on the team clarity around what will happen with their roles. …the role of the CIO and central IT is moving away from command and control, and toward line-of-business enablement. I’m also seeing some organizations … which have taken this one step further in a move toward complete decentralization, where culture and best practice serve as the forcing function that allows teams to operate independently. This trend—trading off consistency for time-to-market—is an important one.
Mainframe and Legacy Systems
The mainframe is often cited as a central point of gravity that stalls or elongates a large cloud migration. Many enterprises feel that they have few people who have the domain and technology expertise to execute a mainframe migration, and those who are still working on the mainframe can be harder to motivate for a cloud migration (though I do believe you already have the people you need). …There are three main approaches to mainframe migrations that we see customers exploring—re-hosting of workloads, batch-job migration and full re-engineering.
Metrics and kpis
A self-motivated, self-grading team: We gave a fixed amount of resources to each line of business and held them accountable to key performance indicators (KPIs) that they set for themselves. Each technology and business owner overseeing a line of business had the ability to move resources around as their customer demand shifted, and we (the leadership team) reviewed KPIs and allocations quarterly to make any necessary changes. …These changes were hard, and there were times when I questioned our approach, thought I’d be fired, or otherwise just thought it would be easier to give up. We were constantly faced with judgment calls that we had to make with incomplete information and unknown risks.
The True Benefits Behind Microservices
….distilled down, it’s clear that their primary benefits are independent deployment and scalability. …An essential area of microservices that’s generally misunderstood is the independence aspect, or how solid the boundaries should actually be between other microservices. … the ramifications can be quite expensive from an infrastructure perspective. (he would commonly ask) – “if I whack your database, how many other teams or microservices will be impacted?”
Why Education Shouldn’t Come Last: The Halo Effect
…I’ve found that finding the ones who aren’t afraid to lead the way (attitude is just as important as aptitude, in many cases) and investing in training and enablement for everyone can be among the most effective ways to get people over their fears. …having access to a seemingly infinite amount of on-demand IT resources can change the game for any organization, provided the culture promotes the use of it. Failure is a lot less expensive when you can just spin down what didn’t work. Educating your staff can turn skeptical employees into true believers, and will make a huge difference in how quickly you’re able to leverage the cloud to deliver results to your business.
Mental health specialists say that acceptance is the first step toward recovery; and that’s totally applicable here, too. Your engineers must accept the fact that they have the ability to learn AWS cloud skills and become experts. It’s also incredibly important for technology leaders within your organization to accept this. As Stephen Orban explains, and as my tenure at Capital One shows, the talent you already have is the talent you need. These are the people who have many years of critical experience developing and running your existing systems.
Reaching Critical Mass: Experience at Capital One and with many of our customers—plus scientific study—has shown that you need to reach a critical mass of 10 percent of engineers advocating a platform before the network effect takes hold. So, scaling this learning and certification to 10 percent of your engineers is a major milestone in your journey. From here onward you get a compelling Halo Effect which starts to influence how your company is seen externally and not just internally. Those engineers externally to your organization who only want to work with Cloud Native companies, will start seriously considering working for you
Adrian Cockroft: An executive once told me “We can’t copy Netflix because we don’t have the people.” My response was “Where do you think they used to work? We hired them from you and got out of their way…”
Resources
- Automating Governance on AWS: https://d0.awsstatic.com/whitepapers/compliance/Automating_Governance_on_AWS.pdf
- The Amazon Cloud Adoption Framework: https://d1.awsstatic.com/whitepapers/aws_cloud_adoption_framework.pdf
- Mainframe migration to AWS whitepaper by Sanjeet Sahay, Tom Laszewski, and Stepyhen – http://www.experienceinfosys.com/Mainframe-Modernization-Aug16-13
- http://amzn.to/cloud-native-vs-lift-and-shift