The following content is shared from an interview with John-Daniel Trask, co-founder and CEO of Raygun, a New Zealand-based company that specializes in error, crash, and performance monitoring. John-Daniel (or JD) started out with repairing PCs out of college, to working as a developer, to finally starting several very successful businesses, including what became Mindscape and its very successful monitoring product, Raygun.
We covered a lot of ground here, and we think you’ll love the following thoughts:
- Is a DevOps team really such a bad thing?
- Why forcing your devs to go to an event booth might be a very good thing
- When is a “requirement” not really a requirement?
- Starting from scratch, with nothing – where would you start?
- What’s the golden ticket to get funding and support for your requests and projects?
And last but not least – “it’s not the big that eat the small, it’s the fast that eat the slow!”
Note – these and other interviews and case studies will form the backbone of our upcoming book “Achieving DevOps” from Apress, due out in late 2018. Please contact me if you’d like an advance copy!
Is DevOps culture first? Well I definitely run into a lot of zealots who swing one side or another. Some people pound the table and say that DevOps is nothing about tools, that it’s all culture and fluffy stuff. These are usually the same people who think a DevOps team is an absolute abomination. Others say it’s all about automation and tooling.
Personally, I’m not black and white on it. I don’t think you can go and buy DevOps in a box; I also don’t think that “as long as we share the same psychology, we’ve solved DevOps.” Let’s take the whole idea of a DevOps team being an antipattern for example. For us it’s not that simple – it’s very easy, on a 16-person startup, to say that a DevOps team is a horrible idea. Well, of COURSE you’d think that, for you cross team communication is as easy as turning around in your chair! But let’s take a larger enterprise, 50,000 people or so, with hundreds of engineering teams. You can’t just hand down “we’re doing DevOps” as an edict and it’s solved. In that case, I have seen a DevOps team be a very successful as a template, something that helps spread the good word by example and train up individual engineering teams to adopt DevOps (this is an actual situation we saw with a top 10 sized software company and it worked very well for them).
What’s a common blind spot you see with many programmers? It’s quite shocking how little empathy there is by most software engineers for their actual end users. You would think the stereotypical heads-down programmer would be a dinosaur, last of a dying breed, but it’s still a very entrenched mindset. I sometimes joke that for most software engineers, you can measure their entire world as being the distance from the back of their head to the front of their monitor. There’s a lack of awareness and even care about things like software breaking for your users, or a slow loading site. No, what we care about is – how beautiful is this code that I’ve written, look how cool this algorithm is that I wrote.
We sometimes forget that it all comes down to human beings. If you don’t think about that first and foremost, you’re really starting off on the wrong leg.
One of the things I like about Amazon is the mechanisms they have to put their people closer to the customer experience. We try to drive that at Raygun too. We often have to drag developers to events where we have a booth. Once they’re there, the most amazing thing happens – we have a handful of customers come by and they start sharing about how amazing they think the product is. You start to see them puff out their chests a little – life is good! And the customers start sharing a few things they’d like to see – and you see the engineers start nodding their heads and thinking a little. We find those engineers come back with a completely different way of solving problems, where they’re thinking holistically about the product, about the long-term impact of the changes they’re making. Unfortunately, the default behavior is still to avoid that kind of engagement, it’s still out of our comfort zone.
Using Personas to Weed Out Red Herrings: I don’t know if we talk enough in our industry about weeding out bad feedback. We often get requests from our customers to do things like dropping a data grid with RegEx on a page. That’s the kind of request that comes from the nerdiest of the nerds – and if we were to take that seriously, think of the opportunity cost and what it would do to our own UX!
We weed out requests like this by using personas. For our application, we think in terms of either a CEO, a tech lead, or an operator. Each has their own persona and backstory, and we’ve thought out their story end to end and how they want to work with our software.
For the CXO level, the VP’s, the directors – these are people who understand their whole business hinges on the quality of their software. They need to keep this top of mind at the very top levels of decision making. For this person, there are graphs and charts showing this strategic level fault and UX information, all ready to drop into reports to the executive board. Then there’s the mid-tier – these are your tech leads, the Director of Engineering – they need to know both high level strategic 30K foot information, and a summary of key issues. The cutting edge though is that third tier, your developer or operator. This person needs to have information when something goes bump in the night. For them, you have stack traces, profiling raw data, user request waterfalls. Without that information, troubleshooting becomes totally a stab in the dark.
Lots of companies use personas, I know. They’re really critical to filter out noise and focus on a clear story that will thrill your true user base.
How can error and crash reporting make for a better performing business? Most of the DevOps literature and thinking I see focuses entirely on build pipelines, platform automation, the deployment story, and that’s the end of it. Monitoring and checking your application’s real-world performance and correcting faults usually just gets a token mention, very late in the game. But after you deploy, the story is just beginning!
I hate to say this – but I think we’re still way behind the times when it comes to having true empathy with our end users. It’s surprising how entrenched that mindset of monitoring being an afterthought or a bolt-on can be. Sometimes we’ll meet with customers and they’ll say that they just aren’t using any kind of monitoring, that it’s not useful for them. And we show them that they’re having almost 200,000 errors a day – impacting 25,000 users each day with a bad experience. It’s always a much, much larger number than they were expecting – by a factor of 10 sometimes. Yet somehow, they’ve decided that this isn’t something they should care about. A lot of these companies have great ideas that their customers love – but because the app crashes nonstop, or is flaky, it strangles them. You almost get the thinking that a lot of people would really rather not know how many problems there really are with what they’re building.
Yet time and again, we see companies that really care about their customers excel. Let’s say I take you back in time to 2008, and I give you $10,000 to invest in any company you want. Are you going to put that into Microsoft, Apple, Google, or Domino’s Pizza? Well guess what – Dominos has kicked the butt of all those big tech companies with their market cap growth rate! The answer is in their DNA – they devote all their attention into ensuring their customers have a great experience. Their online ordering experience is second to none. And that all comes from them being customer obsessive, paying attention to finding where that experience is subpar and fixing it. It’s never a coincidence that customer centric companies consistently outperform and dominate.
What’s forced us as an industry to change and driven a better user experience is Google, believe it or not. They started publishing a lot of research and data around application errors, performance, and prioritizing well performing sites. This democratized things that data scientists were just starting to figure out themselves. And it seemed like overnight, a lot of people cared very much that their website not be dog slow – because otherwise, it wouldn’t be on the first page results of a web search, and their sales would tank. But folks often didn’t care about performance or the end user experience – until Google forced us to.
What would you say to the company that is starting from ground zero when it comes to DevOps? I’m picturing here a shop where they take ZIP files and remote desktop onto VM’s and copy-paste their deployments. If that’s the case – I like to talk about what are the small things you could put into place that would dramatically improve the quality of life on the team. These are big impact, low cost type improvements. So where would I start?
- First would come automating the deployments. Just in reliability alone, that’s a huge win. Suddenly I have real peace of mind. I can roll out releases and roll them back with a single button push, and it’s totally repeatable as a process. If I’m an oncall engineer, being able to roll out a patch through a deployment process that runs automatically at 3 a.m. is a world of difference from manually pushing assets.
- The second thing I would do is set up some basic metrics with a tool like StatsD. You don’t need to allocate a person to spend several days – it’s a Friday afternoon kind of thing to start with. When you start tracking something – anything! – and put it up on the wall that’s when people start to get religion. We saw this ourselves with our product – once we put up some monitors with some of the things coming from StatsD, like the number of times users were logging in and login failures. And it was like watching an ultrasound monitor of your child. People started gathering around, big smiles on their faces – things were happening, and they felt this connection between what they were doing and their baby, out there in the big bad old world. Right away some of that empathy gap started to close.
- Third would come crash reporting. There’s just no excuse not to put this into place – it takes like ten minutes, and it cuts out all that waste and thrash in troubleshooting and fuels an improvement culture.
How do we communicate in the language of business? What I wish more engineering teams understood is how to communicate in the language of business. I’m not asking developers to get an MBA in their off hours – but please TRY to frame things in terms of dollars, economic impact, or cost to the customer. Instead we say, “this shiny new thing looks like it could be helpful”. It’s no wonder engineering talent often feel like the business won’t allow them to get the tools they want – it’s like you’re speaking another language to the folks with the check book.
There’s a reason why we often have to beg to get our priorities on the table from the business. We haven’t earned the trust yet to get “a seat at the table”, plain and simple. We tend to be very maxed out, overwhelmed, and we’re pretty cavalier with our estimates around development. This reflects technology – which is fast moving, there’s so much to learn, and it’s not in a stable state. But when engineers hem and haw about their estimates or argue for prioritizing pet projects that are solely tech-driven, it makes us look unreliable as a partner in the business. We haven’t learned yet to use facts and tie our decisions into saving money or getting an advantage in the market.
Always keep this in mind – any business person can make the leap to dollars. But if you’re making an argument and you are talking about code – that’s a bridge too far. It’s too much to expect them to make that jump from code to customer to dollars. If you tell me you need React 16, that won’t sell. But if you say 10% of your customers will have a better experience because of this new feature – any business person can look at that and make the connection, that could be 5,000 customers that are now going to have a better experience. You don’t have to be Bill Gates to figure out that’s a good move!
Let’s get down to brass tacks – how do I make this monitoring data actionable? We wouldn’t think about putting planes in the air without a black box – some way of finding out after something goes wrong what happened, and why. That’s what crash monitoring is, and it’s incredibly actionable. You know the health of your deployment cycle, you can respond faster when changes are introduced that degrade that customer experience.
Let’s say you are seeing 100,000 errors a month. Once you group them by root cause, that overwhelming blizzard of problems gets cut down to size which is smaller than you’d think. You may have 1,000 distinct errors, but only 10 actual, honest-to-goodness bugs. Then you break it down by user, and that’s when things really settle out. You might find that one user is using a crappy browser extension that’s blocking half your scripts – that isn’t an issue really, and not one you can fix for them. But then there’s that one error that’s happened only 500 times – but it’s hitting 250 of your customers. That’s a different story! So you’re shifting your conversation already from how many errors you’re seeing to the actual number of customers you’re impacting – that’s a more critical number, and one that everyone from your CEO down understands. And it’s actionable. You can – and you should – take those top 2 or 3 bugs and drop it right into your dev queue for the next sprint.
This isn’t rocket science, and it isn’t hard. Reducing technical debt and improving speed is just a matter of listening to what your own application is telling you. By nibbling away on the stuff that impacts your customers the most, you end up with a hyper reliable system and a fantastic experience, the kind that can change the entire game. One company we worked with started to just take the top bug or two off their list every sprint and it was dramatic – in 8 weeks, they reduced the number of impacted customers by 96%!
Think about that – a 96% reduction in two months. Real user monitoring, APM, error and crash reporting – this stuff isn’t rocket science. But think about how powerful a motivator those kinds of gains are for behavioral change in your company. Data like that is the golden ticket you need to get support from the very top levels of your company.
One of my early mentors was Rod Drury, who founded Xero right here in Wellington, New Zealand. He says all the time: “It’s not the big that eat the small, it’s the fast that eat the slow”. That’s what DevOps is about – making your engineering team as reliably fast as possible. To get fast, you have to have a viable monitoring system that you pay close attention to. Monitoring is as close as you can get in this field to scratching your own itch.
What about building versus buying a monitoring system? I’ll admit that I’m biased on the subject, running a SAAS-based monitoring business. But I do find it head-scratching when I talk to people that are trying to build their own. I ask them, “how many people are you putting on this?” And they tell me – oh, 4 people, say a six-month project. And then I say, “what are their names?” They look at me funny, and ask why – I tell them, “I’ve had 40 people working on this for 5 years – apparently now I could fire them and hire your people!” Back in 2005, it made total sense to roll your own, since so much of the stuff we use nowadays didn’t exist. But the times have changed. Even self-hosting as its issues. Let’s say you decide to go down the ELK stack route. Well, that means running a fairly large elastic instance, which is not a set-and-forget type system. It’s a pain in the ass to manage, and it’s not a trivial effort.
To me it also is answering the wrong question. There’s one question that should be the foundation for any decision an engineering team makes: does this create value for our customer? Is our customer magically better off because we made the decision to build our own? I think – for most companies – probably building a robust monitoring system has little or nothing to do with answering that question. It ends up being a distraction, and they spend far more to get less viable information.
Etsy says “if it moves, track it.” Do you agree – should customers track everything? I’m pragmatic on this – if you’re small, tracking everything makes sense. Where it goes wrong is where the sheer amount of data clogs our decision making.
Then folks start to think about sampling data. However, what I often see is someone sitting in a chair, looking off into the distance and says – “yeah, I think about 10% of the data would give us enough”. Rarely do we see them breaking out Excel and talking about what would be statistically significant – people tend to make gut calls. Many of us have forgotten statistics, but there is a lot of really great mathematics that help you make better decisions – like calculating what a statistically significant sampling rate might be.
If you’re tracking everything you possibly could with real user monitoring for example, it can be a real thicket – a nightmare, there’s so many metric streams. You trip over your own shoelaces when something goes wrong – there’s just so much detail, you can’t find that needle in the haystack quickly. This is where you need both aggregate and raw data – to see high level aggregates and spot trends, but then be able to drill in and find out why something happened at the subatomic level. We still see too many tools out there that offer that great strategic view and it’s a dead end – you know something happened, but you can’t find out exactly what’s wrong.
Any closing thoughts? I never get tired of tying everything back to the customer, to the end user experience. It’s so imperative to everything you’re doing. There is literally no software written today for any reason other than providing value to humans. Even machine to machine, IOT systems are still supporting a human being ultimately.
Human beings are the center of the universe. But you wouldn’t know that by the way we’re treated by most of the software written for us. Great engineers and great executives grasp that. They know that to humans, the interface is the system – everything else simply does not matter in the end. So they never let anything get in the way of improving the end user experience.
- https://raygun.com/ – official Raygun site
- https://hanselminutes.com/421/managing-errors-across-platforms-with-raygunio – May 22, 2014 podcast interview with Scott Hanselman and John-Daniel Trask on Raygun
- https://channel9.msdn.com/Events/dotnetConf/2015/Handling-billions-of-exceptions-with-NET–Raygunio – March 5, 2015 Channel9 Interview with John-Daniel Trask on how Raygun handles billions of exceptions
- https://channel9.msdn.com/Events/TechEd/NewZealand/2013/DEV302 – TechEd New Zealand 2013, “DevOps at LightSpeed, lessons we learned from building a Raygun”, 9/6/2013, by Jeremy Boyd, John-Daniel Trask
- Dominos Pizza story and Raygun – https://qz.com/938620/dominos-dpz-stock-has-outperformed-google-goog-facebook-fb-apple-aapl-and-amazon-amzn-this-decade/