DevOps Stories – an Interview with Ryan Comingdeer of Five Talent Software


Ryan Comingdeer is the CTO of Five Talent Software, a thriving software consultancy firm based in Oregon with a strong focus on cloud development and architecture. Ryan has 20 years of experience in cloud solutions, enterprise applications, IoT development, website development and mobile apps.  

Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!

 

Obstacles in implementing Agile: Last week I was talking to a developer at a large enterprise who was boasting about their adoption of Agile. I asked him – OK, that’s terrific – but how often do these get out the door to production? It turns out that these little micro changes get dropped off at the QA department, and then is pushed out to staging once a month or so… where it sits, until it’s deemed ready to release and the IT department is ready – once a quarter. So that little corner was Agile – but the entire process was stuck in the mud. 

The first struggle we often face when we engage with companies is just getting these two very different communities to talk to one another. Often its been years and years of the operations department hating on the development team, and the devs not valuing or even knowing about business value and efficiency. This is hard work, but understanding that philosophy and seeing the other side of things is that vital first step. 

I know I’ve won in these discussions – and this may be 12 meetings in – when I can hear the development team agreeing to the operations teams goals, or an Operations guy speaking about development requirements. You have to respect each other and view work as a collaborative effort.  

For the development teams, often they’re onboard with change because they recognize the old way isn’t working. Often times the business throws out a deadline – ‘get this done by April 1st‘ – and when they try to drill into requirements, they get an empty chair. So they do the best they can – but there’s no measurable goals, no iterative way of proving success over an 18 month project. So they love the idea of producing work often in sprints – but then we have to get them to understand the value of prototyping, setting interim deliverables and work sizing.  

Then we get to the business stakeholders, and have to explain – this is no longer a case where we can hand off a 300-page binder of requirements and ask a team to ‘get it done’. The team is going to want us involved, see if we’re on the right track, get some specific feedback. Inevitable we get static over this – because this seems like so much more work. I mean, we had it easy in the old days – we could hand off work, and wait 12 months for the final result. Sure the end result was a catastrophic failure and everybody got fired, but at least I wasn’t hassled with all these demos and retrospectives every two weeks! That instant feedback is really uncomfortable for many leaders – there’s no insulation, no avoidance of failure. It does require a commitment to show up and invest in the work being done as it’s being done. 


Retrospectives for me are one of the best things about Agile. I wish they were done more often. We do two, one internally, then a separate one with the customer so we’re prepared – and we’re upfront, here’s where we failed, here’s the nonbillable time we invested to fix it. You would think that would be really damaging, but we find it’s the opposite. The best thing a consulting company can do is show growth, reviewing successes and failures directly and honestly to show progress. Our relationships are based on trust – the best trust building exercise I’ve seen yet is when we admit our failure and what we’re going to do to fix it. I guarantee you our relationship with the customer is tighter because of how we handled a crisis – versus attempting to hide, minimize, or shift blame.  

Implementing DevOps: It’s very common that the larger organizations we work with aren’t sure of where to start when it comes to continuous integration or CD. Where do I begin? How much do I automate? Often it comes down to changing something like checking in a new feature say after two weeks of work. That’s just not going to cut it – what can we deliver in 4 hours?  

That being said, CI/CD is Step 1 to DevOps; it’s fundamental. Infrastructure as Code is further down the list – it takes a lot of work, and it’s sometimes hard to see the value of it. Then you start to see the impact with employee rotation and especially when you have to rollback. And think about how much easier it makes it when you have to rollback changes – you can see what was changed and when; without it, you might be stuck and have to fix a problem in place. The single biggest selling point for Infrastructure as Code is security; you can demonstrate what you’re doing to regulate environments, you can show up to an audit prepared with a list of changes, who made them and what they were, and a complete set of security controls.   

A True MVP: Most of the companies we work with come to us because they’ve got a huge backlog of aging requests, these mile-long wish lists from sales and marketing teams. We explain the philosophy behind DevOps and the value of faster time to market, small iterations, and more stable environments and a reliable deployment process. Then we take those huge lists of wishes and break them down into very small pieces of work, and have the business prioritize them. There’s always one that stands out – and that’s our starting point. 

The first sprint is typically just a proof of concept of the CI/CD tools and how they can work on that top #1 feature we’ve identified. The development team works on it for perhaps 2 days, then sysops takes over and uses our tooling to get this feature into the sandbox environment and then production. This isn’t even a beta product, it’s a true MVP – something for friends and family. But it’s an opportunity to show the business and get that feedback that we’re looking for – is the UI ok? How does the flow look? And once the people driving business goals sit down and start playing with the product on that first demo, two weeks later, they’re hooked. And we explain – if you give us your suggestions, we can get them to staging and then onto production with a single click. It sells itself – we don’t need long speeches.  

The typical reaction we get is – “great, you’ve delivered 5% of what I really want. Come back when it’s 100% done.” And the product is a little underwhelming. But that’s because we’re not always sticking to the true definition of a minimum viable product (MVP). I always say, “If an MVP is not something you’re ashamed of, it’s not a MVP!” Companies like Google and Amazon are past masters at this – they throw something crude out there and see if it sticks. It’s like they’re not one company but 1,000 little startups. You’ve got to understand when to stop, and get that feedback. 

I’ve seen customers go way down in the weeds and waste a ton of money on something that ends up just not being viable. One customer I worked with spent almost $250K and a year polishing and refactoring this mobile app endlessly, when we could have delivered something for about $80K – a year earlier! Think of how the market shifted in that time, all the insights we missed out on. Agile is all about small, iterative changes – but most companies are still failing at this. They’ll make small changes, and then gate them so they sit there for months. 

When we start seeing really progress is when the product is released ahead of deadline. That really captures a lot of attention – whoa, we wanted this app written in 15 months, you delivered the first version in two weeks – nine months in we can see we’re going to be done 4 months early because of our cadence.  

So here’s my advice – start small. Let me give you one example – we have one customer that’s a classic enterprise – they’ve been around for 60 years, and it’s a very political, hierarchical climate, very waterfall oriented. They have 16 different workloads. Well, we’re really starting to make progress now in their DevOps transformation – but we never would have made it if we’d tried this all-in massive crusade effort. Instead, we took half of one workload, as a collection of features and said – we’re going to take this piece and try something new. We implemented Agile sprints and planning, setup automated infrastructure, and CI/CD. Yeah, it ruffled some feathers – but no one could argue with how fast we delivered these features, and how much more stable they were, and how happy the customers were because we involved them in the process.  

The biggest problem we had was – believe it or not – getting around some bad habits on having meetings for the sake of having meetings. So we had to set some standards – what makes for a successful meeting? What does a client acceptance meeting look like?

Even if you’re ‘just a developer’, or ‘just an ops guy’, you can create a lot of change by the way you engage with the customer, by documenting the pieces you fill in, by setting a high standard when it comes to quality and automation.  

Documentation: I find it really key to write some things down before we even begin work. When a developer gets a two week project, we make sure expectations are set clearly in documentation. That helps us know what the standards of success are, gets QA on the same page – it guides everything that we do.  

I also find it helps us corral the chaos caused by runaway libraries. We have a baseline documentation for each project that sets the expectation of the tools we will use. Here, I’ll just say – it’s harder to catch this when you’re using a microservice architecture, where you have 200 repos to monitor for the Javascript libraries they’re choosing. Last week, we found this bizarre PDF writer that popped up – why would we have two different PDF generators for the same app? So we had to refactor so we’re using a consistent PDF framework. That exposed a gap in our documentation, so we patch that and move on. 

Documentation is also a lifesaver when it comes to onboarding a new engineer. We can show them the history of the project, and the frameworks we’ve chosen, and why. Here’s how to use our error logging engine, this is where to find Git repos, etc. It’s kept very up to date, and much of it is customer facing. We present the design pattern we’ll be using, here’s the test plans and how we’re going to measure critical paths and handle automated testing. That’s all set and done before Day 1 with the customer so expectations are in line with reality. 

We do use a launch checklist, which might cover 80% of what comes up – but it seems like there’s always some weird gotchas that crop up. We break up our best practices by type –for our Microsoft apps, IOT, monoliths, or mobile – each one with a little different checklist.  

It’s kind of an art – you want just the right amount, not too much, not too little. When we err, I think we tend to over-document. Like most engineers, I tend to overdo it as I’m detail-oriented. But for us documentation isn’t an afterthought, they’re guardrails. It sets the rules of engagement, defines how we’re measuring success. It’s saved our bacon, many times!  

Microservices: You can’t just say ‘microservices are only for Netflix or the other big companies’. It’s not the size of the team, but the type of the project. You can have a tiny one-developer project and implement it very successfully with microservices. It does add quite a bit of overhead, and there’s a point of diminishing returns. We still use monolith type approaches when it comes to throwaway proofs of concept, you can just crank it out.  

And it’s a struggle to keep these services discrete and finite. Let’s say you have a small application, how do you separate out the domain modules for your community area and say an event directory so they’re truly standalone? In the end you tend to create a quasi-ORM, where your objects have a high dependency on each other; the microservices look terrific at the app or the UI layer, but there’s a shared data layer. Or you end up with duplicated data, where the interpretation of ‘customer’ data varies so much from service to service.  

Logging is also more of a challenge – you have to put more thought into capturing and aggregating errors with your framework. 

But in general, microservices are definitely a winner and our choice of architecture. Isolation of functionality is something we really value in our designs; we need to make sure that changes to invoicing won’t have any effect on inventory management or anything else. It pays off in so many ways when it comes to scalability and reliability.  

Testing: We have QA as a separate functional team; there’s a ratio of 25 devs to every QA person. We make it clear that writing automated unit tests, performance tests, security tests – that’s all in the hands of the developers. But manual smoke tests and enforcing that the test plans actually does what it’s supposed to is all done by the QA dept. We’re huge fans of behavior driven development, where we identify a test plan, lay it out, the developer writes unit tests and QA goes through and confirms that’s what the client wanted.

With our environments, we do have a testing environment set up with dummy data; then we have a sandbox environment, with a 1 week old set of actual production data where we do performance and acceptance testing. That’s the environment the customer has full access to. We don’t do performance testing against production directly. We’re big fans of using software to mimic production loads – anywhere from 10 users/sec to 10K users/sec, along with mocks and fakes with our test layer design. 

Continuous Learning: To me continuous learning is really the heart of things. It goes all the way back to the honest retrospectives artifact in scrum – avoiding the blame game, documenting the things that can be improved at the project or process level. It’s never the fault of Dave, that guy who wrote the horrible code – why did we miss that as a best practice in our code review? Did we miss something in how we look at maintainability, security, performance? Are lead developers setting expectation properly, how can we improve in our training?  

Blame is the enemy of learning and communication. The challenge for us is setting the expectation that failure is an expected outcome, a good thing that we can learn from. Let’s count the number of failures we’re going to have, and see how good our retrospectives can get. We’re going to fail, that’s OK – how we learn from these failures?  

Usually our chances of winning come down to one thing – a humble leader. If the person at the top can swallow their pride, knows how to delegate, and recognize that it will take the entire team to be engaged and solve the problem – then a DevOps culture change is possible. But if the leader has a lot of pride, usually there’s not much progress that can be made.  

Monitoring: Monitoring is too important to leave to end of project, that’s our finish line. So we identify what the KPI’s are to begin with. Right now it revolves around three areas – performance (latency of requests), security (breach attempts), and application logs (errors returned, availability and uptime). For us, we ended up using New Relic for performance indicators, DataDog for their app layer KPI’s, and Amazon’s Inspector. The OWASP has a set of tools they recommend for scanning; we use these quite often for our static scans.  

Sometimes of course we have customers that want to go cheap on monitoring. So, quite often, we’ll just go to app level errors; but that’s our bare minimum. We always log, sometimes we don’t monitor. We had this crop up this morning with a customer – after a year or more, we went live, but all we had was that minimal logging. Guess what, that didn’t help us much when the server went down! Going bare-bones on monitoring is something customers typically regret, because of surprises like that. Real user monitoring, like you can get with any cloud provider, is another thing that’s incredibly valuable checking for things like latency across every region.  

Production Support by Developers: Initial on-calls support is handled in-house by a separate Sysops team; we actually have it in our agreement with the customer that application developers aren’t a part of that on-call rotation. If something has made it through our testing and staging environments, that knocks out a lot of potential errors. So 90% of the time a bug in production is not caused by a code change, it’s something environmental – a server reboot, a firewall config change, a SSL cert expires. We don’t want to hassle our developers with this. But, we do have them handle some bug triage – always during business hours though.  

Let’s just be honest here – these are two entirely separate disciplines, specialties. Sysops teams love ops as code and wading through server error logs – developers hate doing that work! So we separate out these duties. Yes, we sometimes get problems when we move code from a dev environment to QA – if so, there’s usually some information missing that the dev needs to add to his documentation in the handoff to sysops.  

And we love feature flags and canary releases. Just last week we rolled out an IOT project to 2000 residential homes. One feature we rolled out to only the Las Vegas homes to see how it worked. It works great – the biggest difficulty we find is documenting and managing who’s getting new features and when, so you know if a bug is coming from a customer in Group A or B. 

Automation: For us, automating everything is the #1 principle. It reduces security concerns, drops the human error factor; increases our ability to experiment faster with infrastructure our codebase. Being able to spin up environments and roll out POC’s is so much easier with automation. It all comes down to speed. The more automation you have in place, the faster you can get things done. It does take effort to set up initially; payoff is more than worth it. Getting your stuff out the door as fast as possible with small, iterative changes is the only really safe way; that’s only possible with automation. 

You would think everyone would be onboard with the idea of automation over manually logging on and poking around on VM’s when there’s trouble, but – believe it or not – that’s not always the case. And sometimes our strongest resistance to this comes from the director/CTO level!  

Security: First, we review compliance with customer – half the game is education. We ask them if they’re aware of what GDPR is – for 90% of our customers, that’s just not on their radar and it’s not really clear at this point what compliance means specifically in how we store user information. So we give them papers to review, and drop tasks into our sprints to support compliance for the developers and the sysops team with the CI/CD pipeline.  

Gamedays: Most of my clients aren’t brave enough to run something like Simian Army or Chaos Monkey on live production systems! But we do gamedays, and we love them. Here’s how that works: 

I don’t let the team know what the problem is going to be, but one week before launch – on our sandbox environments, we do something truly evil to test our readiness. And we check how things went – did alerts get fired correctly by our monitoring tools? Was the event logged properly? How did the escalation process work, and did the right people get the information they needed fast enough to respond? Did they have the access they needed to make the changes? Were we able to use our standard release process to get a fix out to production? Did we have the right amount of redundancy on the team? Was the runbook comprehensive enough, and were the responders able to use our knowledgebase to track down similar problems in the past to come up with a remedy?  

The whole team loves this, believe it or not. We learn so much when things go bump in the night. Maybe we find a problem with auto healing, or there’s an opportunity to change the design so the environments are more loosely coupled. Maybe we need to clear up our logging, or tune our escalation process, or spread some more knowledge about our release pipeline. There’s always something, and it usually takes us at least a week to fold in these lessons learned into the product before we do a hard launch. Gamedays are huge for us – so much so, we make sure it’s a part of our statement of work with the customer.  

For one recent product, we did three Gamedays on sandbox and we felt pretty dialed in. So, one week before go-live, we injected a regional issue on production – which forced the team to duplicate the entire environment into a completely separate region using cold backups. Our SLA was 2 hours; the whole team was able to duplicate the entire production set from Oregon to Virginia datacenters in less than 45 minutes! It was such a great team win, you should have seen the celebration.  

 
 


 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.