Episode #119: Scaling your Startup with Brian Scanlan

November 15, 2021 • 59 minutes

On this episode, Jeremy and Rebecca chat with Brian Scanlan about the technical strategies you should avoid (and embrace) when scaling your startup, why you probably shouldn't go multi-region, how fixing your on-call processes can improve company culture and reduce developer burnout, and so much more.

Brian Scanlan is the Principal Systems Engineer at Intecom where he leads their developer infrastructure efforts, helping teams make products resilient to failure, scalable to customers' needs and need little to no human intervention to work well. Based out of Dublin, Brian has previously held posts with HEAnet and Amazon, and has experience helping teams build their technical strategies, as well as designing and implementing solutions. Brian is a frequent contributor to Intercom’s engineering blog, and has presented at LeadDev Con in London, Turing Fest, and Dash by Datadog.

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly.

Rebecca: And I'm Rebecca Marshburn.

Jeremy: And this is Serverless Chats. Hey, Rebecca. How are you doing?

Rebecca: Hey, Jeremy. I'm doing good. You could probably guess what I did this last weekend. Based on what I tell you I do every weekend.

Jeremy: You were at another wedding. There's something wrong with your voice. Were you screaming a lot or what was...?

Rebecca: No. Actually, I have a little bit of a cold. So working from home is nice because that means you don't have to worry about getting other people sick. I was actually at a art gallery where you're like-

Jeremy: [crosstalk 00:00:32] Which also makes sense.

Rebecca: "That's what you do [inaudible 00:00:32]." How about you, Jeremy? How are you doing?

Jeremy: The most exciting person I know, Rebecca. It's amazing.

Rebecca: That's definitely not true.

Jeremy: Well, the good news is when you do have a cold and you have to work from home, you're self isolating. So it's responsible to do that.

Jeremy: So anyways.... Well, so one of the things though, with working from home and being disconnected with people, is you don't get to communicate as much with customers and things like that. And I think our guest today probably has done quite a bit of engineering to help communicate with customers. Do you want to introduce them?

Rebecca: Right you are. And I'm really impressed with that transition. I'm just going to say that out loud. So our guest today is principal systems engineer at Intercom and leads their developer infrastructure efforts. He helps teams make products resilient to failure and scalable to customers' needs. And it is Brian Scanlan. Hey, Brian. Thanks for joining us.

Brian: Hey. Thank you so much for letting me on your show.

Rebecca: We are happy to let you on our show. If that's the words you want to use.

Brian: Sure.

Rebecca: Anytime. Before we dive too far in, can you tell us first, what is Intercom? So even if people I think are unfamiliar with the brand, they've likely or almost certainly used it. You have over 25,000 customers. So maybe help people understand this thing that they don't know they've probably used multiple times over.

Brian: So Intercom, these days we call ourselves a conversational relationship platform. And we help businesses build relationships, talk with their customers. In the real world, everybody knows us as this little thing that pops up at the bottom right hand corner of most popular, trendy, SaaS companies.

Brian: And it's grown pretty well. I've been working with Intercom now for seven years over the 10 years Intercom has been around. These little chat popups they've turned into a decent business and we think we're pretty good at it. We provide conversational marketing and customer engagement, but mostly support. People really like to use messengers, use this way of kind of talking to their customers. And so a lot of what our customers uses for is really to help customers do everything from onboarding to help them problem solve, troubleshoot and engage and make sure that they're using their product well.

Jeremy: And I'll tell you that I do love it. And it's funny, my wife and I share this same sort of attitude. We don't mind talking to people, we just don't want to talk to people on the phone. You know what I mean? If you ever to pick up the phone for customer service or something like that and wait on hold and whatever... I love just this asynchronous, and sometimes synchronous, if somebody's on with Intercom, they can chat back and forth with you. And then I think just all the tools. We've used it at other companies I've worked for, behind the scenes and just being able to sort of segment those users, and then you've got knowledge bases and all kinds of stuff like that.

Jeremy: So if anybody is running a SaaS company or want to provide customer service, definitely check out Intercom because it is a very cool service. So the other thing about that is we're talking about not only the asynchronous chat messaging, the JavaScript that has to run on somebody's site to pop up, but then also, I would assume, quite a bit of infrastructure that run behind the scenes in order to make all of this stuff work. So could you just give us sort of an overview of the infrastructure itself and what kind of a massive system that is?

Brian: So the most interesting part or way to start talking about it is that we pretty much have a Ruby on Rails monolith in the backend. You hear mostly about monoliths about people breaking them up into microservices and stuff, but we're actually pretty proud of our monolith. We have invested a lot in making it work well for us. And a lot of the techniques actually are kind of similar enough in practice around the things we do to make it sustainable to work in the monolith. So we've got basically internal services for important kind of functions. We've got clear ownership of code. We use techniques like marking up every code file with which team owns these things, and then we use techniques like surfacing. Automatically surface these tags into metrics generated by anything that is invoking the functions that are in the code.

Brian: We use AWS auto scaling groups in EC2 as our unit of scalability. And we build one of these for every single asynchronous worker and every different type of API we have. And so we've ended up with over 300 scaling groups serving Intercom's needs. That gives us this units of deploy deployability for each of the different pieces of work that happens either in real-time or in the background to help Intercom build the product and to serve Intercom. And this allows us a good blast radius, it gives us these nice things to allow teams internally to have clear ownership over different things, and make sure that different parts of Intercom can break and not break different things.

Brian: We've also invested a lot in making sure that deployments are fast. And when you're deploying well over a thousand EC2 hosts across 300 auto scaling groups, and you've got this giant Rails monolith with tens of thousands of tests to run, it's taken a lot of work to get it to make sure that we can get this done reliably in about 12 minutes these days. And we've invested a lot in making sure that things don't degrade to the point of where we have to throw it in the bin and rewrite the whole thing in some trendy microservice framework.

Jeremy: Awesome. Well, so I think that if the listeners had been paying attention, they realized you were quite the expert on building these massively scalable systems and running it on AWS and doing all this sort of stuff. And again, this is a serverless podcast and I know a lot of what you're doing at Intercom isn't necessarily serverless. Although, I'm sure you probably use a lot of services themselves.

Jeremy: But I think over the years you've come up with probably some good advice or some good lessons learned. And you wrote an article... I don't remember how long ago it was. But it was titled 10 Technical Strategies to Avoid and then Five to Embrace. And I'd love to talk about those a little bit because I think this ties into serverless. And one of the ones that really stood out to me... And we can get into a couple other ones, and Rebecca, feel free to ask some questions around this as well.

Jeremy: But the one that you talked about sort of containers versus serverless for your environments. Because five years ago, 10 years ago, maybe when Intercom's getting started, Ruby on Rails was sort of the way to go. Again, I'm still a big fan of monoliths too. So hug your monolith if you need to because I love a good monolith when it works. But this idea that you kind of put out in the article was if you're day one starting your startup, investing in Kubernetes might not be the best thing to do. Can you explain why that's the case?

Brian: So I think one thing we're good at Intercom and what has been part of it's success is keeping the focus on delivery and keeping a momentum of delivery of value for your customers. Internally, we've written up our values and go through how this actually works out in our technical strategies. And there's a few different ways this kind of comes out, but some of it is that we are technically conservative. And what this means is we don't go about looking for the best technology for something. If something is good enough, we'd rather kind of iterate and just use something, get started and go. And kind of validate the whole reason why you're even building out this stuff in the first place, quickly, rather than spending a lot of time.

Brian: And I think setting up Kubernetes day one, that doesn't sound like you're doing much work really to validate your startup in many cases. You can probably get there faster with something a lot more simpler. And for sure, maybe there can be unique kind of spaces, and especially if you are selling into people who use Kubernetes, then of course, you want to be using these kind of technologies. But I think the investment in these kind of platforms comes out when you've actually got something that's worthwhile to migrate to them or something that's big that's worth the investment.

Brian: I think biasing for simplicity at all stages while your startup is fighting for life, really is the approach to kind of keep things viable and to keep things focused on trying to get your startup to any kind of success. And so if you're really fast at Kubernetes and you've done this all day, then sure, go for it if that's the fastest way you can do it to deliver value. But I think for most people it's going to slow you down, even though it looks like progress while you're building out this stuff.

Rebecca: So you mentioned the idea of being technically conservative. And I think another way that you've put this with before, is run less software. And that might also tend to this idea of run less services, run fewer services. And when it comes to monolith versus microservices, I'm wondering, to microservice or not to microservice? And you say microservices can cause undifferentiated heavy lifting. And so I'm wondering if you could talk a little bit more about that.

Brian: It's interesting. When I joined Intercom, we kind of had the expectation that as we grew up and became a real company, we were going to build out loads more services. And this was the way we started building out some features. So one of them was our web hooks. And so the initial implementation of web hooks in Intercom was this Java service and talked back and forth to the main Intercom monolith. And it worked and its scaled and it did its job. But over time, Intercom changed, people moved on, ownership of the web hook service kind of moved from team to team. And it was kind of one of the only things we had written in Java for a long time, there was only a handful of services.

Brian: And then people, like engineers who are kind of doing most of their work in Ruby or in JavaScript, would kind of avoid doing anything with this Java stuff. And it became a real pain operationally. And so zooming out a bit, what we saw in practice for us in our environment, so it's definitely not universal, was that teams were more effective working in a monolith, and thus the doing stuff out of the default way, the way that's known to be fast and that we're investing into, just slowed them down. And I think it held us back in terms of product development. Because we were less capable or weren't working on the Java code base a lot, we just didn't build new features in that area. It wasn't as easy to. I think what we learned from this, and we're not dogmatic about it, but the whole investing in the monolith, we did it and have invested further in the monolith and indeed we've folded back the web hook service into the monolith since.

Brian: And is that because we've observed that teams are more effective and enjoy working in the monolith, where they get a lot of stuff for free. As opposed to where in services, they have to start worrying about a lot of scaling and [instant choices 00:12:30] and observability and hooking all these things up, designing APIs, and then figuring out what data is authoritative for what and where to do your [cache index 00:12:41]. There's all this stuff that we give for free effectively in the monolith. And so running that software for us, what that looks like in practice and being technically conservative, is that we try and just reuse the same things over and over again. And a lot of our services just look the same, or a lot of our features just look the same.

Brian: It's bunch of Rails, and then that talks to few [Memcache 00:13:09] and my SQL databases, then it'll send off an SQL message somewhere, and then something will synchronistically process this stuff. It's all the same kind of building blocks and keeping the number of building blocks low and just to mention novel bits down to as small as possible. We found to result in strong kind of outcomes for us in terms of how easy it is to operate these things and how fast teams can move with these things and build with these things.

Brian: So in terms of microservices and that, I think there's overhead. There's overhead in everything. But there's overhead in just having to design the coupling between them and designing those kind of interfaces. And for sure, you can end up with kind of clean interfaces and get to a very clean architecture and be able to move individual parts pretty fast with them, but it's not the only way to get to those kind of areas. I think you can apply the same discipline to different types of code bases and different shapes of architectures and have similar outcomes.

Jeremy: Yeah, no. It's funny to hear so many people who... Again, I love microservices. I've done a lot of microservice design, but I also have done far more monolithic applications. And if I had to weigh which ones were easier to deal with in the long run, it's always been the monoliths. And that's one of the things that's interesting about serverless, to bring that back in a little bit, is the fact that you can still take a monolithic approach, have a lot of that sort of shared code and all in the same stack and so forth, but then you still have some knobs where you can turn up scalability on certain functions or whatever the services you're using there.

Jeremy: And there's a couple of other things you talk about in this article you talk about make sure you use infrastructure as code or to not configure things through the console so that you just have these repeatable patterns. I think these are just good, smart things to do. But one of the things... And maybe we move to this and we can certainly go back to the things you shouldn't do, but let's give some people some positives. In terms of the things that you should do, one of the things is you say bias towards the higher level services. And I always look at this as the build versus buy argument. If there's something that gets you 80%, 90% of the way there, then that might be good enough. Especially if it saves you six months in development time. For example, it might be smarter to start with serverless than Kubernetes because it could save you six months of development time just getting your environment set up.

Brian: Big time. I used to enjoy running database servers myself, running my SQL and making sure backups work and getting to a high availability setup and all that. That's good fun and you can learn a huge amount from it, but I think going for a higher level service in using the likes of [RDS 00:16:02] are even higher level than that. So some of the AWS services that try to take entire workflows and mean that you are not even just using the kind of lower level building blocks to get that kind of functionality. I'm thinking in terms of things like workflow engines and stuff like that.

Brian: You can build a workflow engine yourself on top of a database, but plugging into a pre-made workflow engine, you have to get used to it and you have to learn it and kind of map your workflow onto it. But using these things, again, it gets you faster to validating the reason why you're building the thing in the first place. Like you said, you can get to 90%, and then once you know what it is, what's holding you back, what kind of features are missing or what important parts you need to build yourself, you can do that after you've saved a lot of time and made progress by using a managed service. And so we do this in practice Intercom even today.

Brian: I think this is a good thing. Not just for startups, but anywhere that's really trying to do fast growth. I've helped a team recently where they weren't sure whether... It was pretty much a toss up between writing their own piece of code to do this validation in our data warehouse, or pick an open source project which looked pretty good, or use a service which does this. I said like, "Let's just use the service," and we'll learn more from doing that at faster than doing all the kind of grunt work with the other kind of pieces of code. Even though ultimately what using the service might tell us is that we don't want to use that service because the valuable bits are something else. but we wouldn't get to that knowledge faster if we hadn't tried using something which was fit most use cases already. So I think this is something that's good advice. Not just for starters, but anywhere that's really moving fast and needs to experiment.

Jeremy: And the other thing that I always find is super interesting, whether it's a SaaS product you buy or a managed service or even an open source project that maybe has some good maintenances, somebody already made a lot of decisions for you and they brought a lot of their own domain expertise into that. So especially if you're getting into something... The database thing is a no brainer. Because again, I remember... I can't remember. Well, I probably can, but I shouldn't. Remember how many databases I had to set up and manage over the years and do the backups and be like, "Oh something's corrupted. Let's dig through transaction logs and try to figure out where that went wrong."

Jeremy: So those are the kind of things where handing that stuff off to somebody and not even having to think about it, and really, you're just worried more about your data integrity and making sure that things are being written correctly and that you're following whatever rules you want to follow for your own business domain. But just bringing in that domain knowledge from other people and having all those years of learning just baked in immediately, that goes beyond just having to build and manage the service. It gives you experience and knowledge that could take you years to learn.

Brian: Absolutely. Building on top of the shoulders of other giants, you get that for free effectively by using these-

Jeremy: Did you call me a giant? I'm building on top of other giant... Or was it...? Okay.

Brian: We're all giants building on top of other giants. Yeah.

Rebecca: So I think a lot of this comes back to optimization. It feels so good when things are optimized when things are efficient and when you're working in the way that you feel most confident in that teams can run the fastest in, that they feel like they're actually making an impact and they can see that. Jeremy hopped to some of the positive stuff. I want to take us back to the negative. So if we could just go back there for a second. You talk about optimization in terms of cost and building for scale and optimizing costs.

Rebecca: And I love how you put it. You put in terms of snacking. And we always want a snack and say like, "Oh, I'm just going to take this little potato chip, and then crunch this thing," but then that gets in the way of actually looking at a big picture business outcome of your work. And so I'm wondering if you can talk a little bit about optimizations in terms of the pitfalls around that when people start to zero in on snacking, let's say, versus remembering this big picture for optimization for teams moving forward quickly.

Brian: I've spent a lot of time in the AWS cost optimization space. And a lot of the advice that you'll get are like 10 things to do, you'll turn off unused instances and maybe optimizing your [EBS 00:20:40] in volume types and deleting elastic IP addresses that aren't in use and stuff. These things, they do cost money and it's fun to kind of turn these things off or whatever. But they rarely make significant differences to the bottom line. They can save a few dollars here and there and it's kind of nice to kind of have a cleaner environment maybe, but ultimately what affects your margins, what affects your cost growth is your architecture and how your systems are built. And for sure, there are insights to get from cost tooling, from understanding where your costs are coming from, driving these, and then making changes from how you understand these things.

Brian: And there's well known ways, on the building level of, in terms of committing to AWS, using savings plans and reservations and enterprise discount plans. Stuff like that where you can also save a huge amount of money. But the real stuff that makes your business viable and that matters in the long term, is understanding the relationship between costs and your business and what's driving the costs. And the big things that make a change are architecture level changes or the implementation of your service. And what we found that works well for Intercom is being pretty reactive. Not even trying to optimize these things as we're building them, just building as fast as we can, and then seeing what the impact of cost is.

Brian: And then tracing it back to the change, to the feature, to the thing that we're doing as part of our product and seeing what kind of impact it had on costs. And so it doesn't make it worth my while, or it's not useful for me to go in and start digging into... Let's say our Lambda costs. It's a very, very, very, very small percentage. It's not even a percentage of our overall bill. I know it's there, but it's not even worth optimizing. It's just a very, very small thing. Whereas looking at how the design and implementation of our JavaScript messenger and how that's sending data into our APIs at scale, that's a big thing that requires software changes. That can make really, really significant differences.

Brian: So I think in the cost space, there's lots of stuff to go after. You can easily find a few things which look like good things to turn off. You can save at times what look like good individual pieces of money, but it's the overall architecture which has got to make your business make or break. I'm going to change the margins that will make your business profitable or not.

Jeremy: And it's all about total cost of ownership too. This idea of, do you take $125,000 a year engineer and have them go spend a week trying to figure out how to save $100 a month on your Lambda bill? These are the kind of things that it may sound interesting to be like, "Hey, I saved us some money. But oh, by the way, I just cost the company 10 grand in order to do that." So you'll never make that up in the long run. I don't want to talk this article the whole time. There's a bunch of other things I want to get to. But in just to touch on maybe some of those other things that you should embrace, especially as a startup, baking in security right from the beginning.

Jeremy: That's one of those things where... I have so many scars from security issues. I've been doing this for a very long time. So it's like just build in security right from the start. I love this concept or the idea you had in there for hire for potential. It's great to hire specialists, but generalists, especially in a startup, are just so much more flexible and go in different roads, different paths. You just get different ideas. I love that. Obviously, focusing on the customer is a huge part of building a startup. If you can't get customers to use it, what's the point of building it?

Jeremy: But the continuous deployment... And we just had [Charity Majors 00:24:51] on and we were talking about this constantly be merging, just small changes, just keep getting code in there. That's super important. We had a whole conversation about that, but you focus on this a lot as well. You mentioned deployment speed getting down CICD to, I think, 12 minutes to deploy to all those instances and stuff. So tell us a little bit more about sort of your philosophy and maybe Intercoms philosophy behind just CICD and how quickly you get changes out.

Brian: One of the early blog posts on Intercoms blog was titles Shipping is your Heartbeat. And I think Charity herself has used it a lot of times as a way of describing these things. But really there's a good few ways of looking at it, but the way we think about it is in terms of what the shipping constantly brings to the quality of the product you're building. Coupling your engineers who are building product with how it's being used by your users and keeping that as close as possible. So that you're not just building and walking away and not understanding it. You're building shipping, you're seeing the feedback in real-time, you're seeing the usage in real-time. And indeed, you could be talking to your users in real-time using Intercom.

Brian: I think it's really consistent with what Intercom has built about allowing this kind of conversation between businesses and their customers. This way of building we think results in the highest quality product. That we're not doing huge amounts of upfront design and designers giving these massive designs over to engineers to go and implement and lots of structure and process.

Brian: We found that the best projects, the best features we get out the most productive way of building really, really high quality product, is to get people into the same team, with a strong mission, with ownership of an area, and for them to iterate and iterate and iterate and iterate, and really build the smallest sliver of a feature possible, get it out, see how it's used and then keep on iterating. And so deployments, they're our heartbeat, they're the part of this that results in the highest quality product. There are loads of other nice things as well.

Brian: Operationally, it's easier to troubleshoot if you're deploying all the time, you're pushing small [diffs 00:27:20] all day. Security wise, the confidence of knowing, "If we've got some sort of problem, we can fix it in 10 minutes." That's a big deal. But as a product focused company, we see just huge benefits in connecting the developer to what they're building and that speed element just allows them to iterate, get something out, measure, see how it's used, talk to the users and keep going. And that process results in high quality products, like Intercom, we think.

Jeremy: Right. And you mentioned, again, the small diffs and getting changes out quickly. We talked about this with Charity too. Just the joy, the satisfaction level of a developer goes up as the time from commit to deployment goes down. They're linked together. But in terms of the speed of deployment, so you have a very complex system with lots of EC2 instances scaling groups, all that kind of stuff. You said you got it down to 12 minutes. The speed of that piece of it, though, I think you said before it was 30 minutes or something like that. So if you have an issue, you said maybe you roll it back, maybe you fix it, whatever it is, but what are the absolute necessary checks that you need to have in there? Or just maybe a little bit more on why the speed of it is so important. Because, again, waiting 30 minutes for a deployment is kind of painful.

Brian: 30 minutes is definitely in the territory of where you push code and you kind of forget that you've done it. You almost have to set an alarm to check a dashboard of seeing when your stuff actually hits production. And so that in itself makes engineers less engaged, or even it distracts them from watching how things actually work in production or seeing where their stuff gets it. So shrinking that down keep keeps them closer to what they've built, gives them more information, and also it results in better outcomes of... Where maybe something bad does get out and they're more likely to it just be watching dashboards or looking at the alarm channel or just around, rather than trying to see who pushed something 30 minutes ago.

Brian: Maybe they're offline by an error or whatever. So getting that down as short as possible. And honestly, I'd love to get it a lot shorter than 12 minutes. I think it's okay. The interesting thing was... When I joined, we had full CICD. And we've had some rocky times. We've invested in a lot making it good, but we kind of had a bit of a frog in the boiling water situation over some time of where the deployment times just kind of crept up. And it wasn't that we didn't want to do fast deployments, it's just we kept on adding on more stuff. More safety checks, the ability for teams to control how fast they were deploying to individual fleets. We have some fleets which are very sensitive to deployments and just need time to process jobs and can only take out certain amounts of capacity at times to install new software.

Brian: And so we built all this kind of safety stuff to make the environment work well for a lot of different use cases. And all of these features, all these safety things just slowed us down. And we saw them all as individual successes. We were happy that our environment was more safe or whatever, but we realized this has gone too far. This is too long. We need to dig in. And then the interesting thing that came out of it was we dug in and we thought we might have to do the kind of work that you might see at a conference talk. We took this system and we replaced it with something brand new, or we moved to microservices, whatever. But what we ended up doing was just real boring analysis of where the time was being spent, what was driving that time.

Brian: Some of it was down to a bunch of AWS features that we were kind of abusing. We make heavy use of AWS systems manager, run command to orchestrate the software deployment across all of the hosts. We were kind of pushing it hard and we had to get better at doing that. And we ended up talking to the AWS team, but also we analyzed the kind of safety features we did. Realized we didn't have to do them all in sequence. We could do a bunch of them parallel. And then just change the environment so that we didn't have to have as many staggered kind of deployments. That we could do a big bang type thing and get away with it by changing the way we were doing deployments on workers. So there was no one big thing here that suddenly shrunk our deployment times down.

Brian: It was just analysis of where all the time is being spent and a bunch of small changes. Just 30 second win here, one minute win here, but that all those that work got us back down. But we've been trying to hold it pretty tight since then. We've been still making changes to the environment. We're trying to add more security into how we build our artifacts and different things like that. But now I think once we've got the line down, we want to keep it there and sort of doing a better job of kind of reviewing every week, what we're looking at in terms of deployment times and where delays are coming out of it. I think the moral of the story is that we've decided it's important to us, but we always knew... But we've decided, "This is important. We've got to keep it here." And that takes constant effort to keep something like that. Because I think the default is what we ended up in, where things just drift over time and you get to the point of where, "Oh no, it's now taking 30 minutes."

Jeremy: That's the point that I was just going to make is that you did the work. And you realize the importance of it. And I think that a lot of people or a lot of companies, they let that drift and they tie those metrics together and being able to optimize that. Now again, had you switched to microservices, you could have had independent deployability and it would've been faster, and then it wouldn't have created a whole bunch of other headaches.

Jeremy: But anyways, I do think that is the important point though. Is that understanding the fact that there is a relationship between the quality of your code and how quickly you can get that into a production environment and be able to see those things. Because you want to see that before your customers do. And again, faster deployment times also means faster rollback times and all kinds of things like that. So super important.

Rebecca: So much of this stuff is this idea of building for scale and optimizing when you need to. Maybe not optimizing prematurely, but obviously in this case, you're like, "All right, now it's no longer premature. We need to react to this because we have either overoptimized or underoptimized and we didn't optimize at the right time. And now it's time to address it," ultimately, so you can build for scale and you can build toward a global solution for a company that is going global.

Rebecca: And so I want to ask you a little bit about what it means to go global. And that's going to probably mean also going multi-region. And you have a little bit of experience with this at Intercom. You've been thinking about it a lot. Especially... I'm going to say "recently" with air quotes, because recently can mean a lot of different weeks, months, probably even years at some point, in your own mind. So could you tell us more about what that means at Intercom and this idea of working towards multi-region and how you're thinking about that? Both in terms of optimization, getting yourself ready, being able to do this at scale, having the right things in place, but not too many right things in place.

Brian: So over the last about eight, nine months or so I've been working on building Intercom into a European region. And this has been something we've talked about for a long time and customers have asked us about for a long time. So Intercom's customer base, it's global. We've had great success selling into Europe and other places. But there are certain customers or certain types of customers who are sensitive to data transfers to United States. The legal situation has kind of been up in the air. There's certainly a good deal of risk and uncertainty in the area. And so for perfectly reasonable reasons, bunch of customers just don't want to take on that risk and they want their data to stay in the EU or maybe other places. So for a long time we've said, "Okay, we understand your concerns to customers."

Brian: And we've kind of given guidance. We ourselves, we're happy to store data in the United States, but we're also very conscious that a lot of our customers simply don't. And a lot of perspective customers as well just won't consider using services that store their customer data in the United States. And so we worked for a while to try and understand this. And we've done numerous exercises working with our business developments team, trying to figure out the opportunity, look at what we think the work would be involved. We've had ideas around what an architecture could look like and kind of positioned that against the return we'd get.

Brian: It didn't really make sense, though. The numbers kind of never really worked out. Because we strongly believed that a build out of this scale, of where we've got this huge infrastructure in [USCS1 00:36:59] in AWS and trying to mimic any of that, or even just a small amount of that, in a new region is going to be substantial to just the sheer scale or the infrastructure, but also the amount of work involved in getting every single feature over, every single part of Intercom over.

Brian: So we always put a pretty high premium on the amount of work there and it's going to do nothing for our existing customers. It's not going to help out them at all. So we kind of kept it on a back burner and kept talking about it, but not doing too much about it. And then kind of late last year, we decided, "Let's just do it." But let's not do it in a way that is going to guarantee success. We're going to effectively spike a small as possible installation of Intercom and build the smallest possible thing to prove whether or not this thing can work and apply really technically conservative principles to what we build out.

Brian: As well, at the same time shrinking down the amount of infrastructure that we used overall to build at Intercom. So we built out a few capabilities of sharing workers and a few different things to allow us to do this. And then rapidly over the course of a few months, we built out enough of what, in effect, was a prototype of Intercom in the EU, but a greatly shrunken down version of Intercom, running as much shared stuff as possible.

Brian: Whereas in USCS1, we might have literally 10 different Aurora MySQL clusters. We've just got it all down on one. And we're being very, very aggressive in running the minimalist amount of infrastructure to support all these things. We also found that the best way to get work done... And there's lots of changes we have to make across our code base. Not just in the architecture, but how the large code base was ported over to this environment, was to just do all the work ourselves.

Brian: So we didn't work across Intercom, but we didn't really engage with too many teams and try to get stuff on our roadmaps or anything. We just assembled a small team of high judgment, experienced engineers who could make a lot of progress against getting any part of Intercom working in Europe and doing it fast and doing it independently and not really blocking or having to be blocked on getting permission to do things or looking for advice.

Brian: Like I said earlier, a lot of Intercom looks the same. It's the same kind of stuff that build our features. And we quickly established patterns of identifying what parts of Intercom needed to be fixed up to work in the environment. So we validated relatively quickly that, "Hey, we can actually get this working. It does work, it boots. It does Intercom-type stuff." And then over the kind of last few months, we've been doing more work around kind of [QA 00:40:01] and hooking things up to sales systems and making it a professional setup that we can actually sell, as opposed to just a standalone instance of Intercom that that does largely work. So I think the stuff that worked for us, it was being really aggressive, but unambitious on the infrastructure and technology side.

Brian: We just reused what we knew already, but we also didn't want to just copy and paste. We knew that wouldn't work. We knew the setup time would be too much. And so we shrunk it down as small as possible and we're as aggressive and possible as possible when it came to that down to that. We didn't want to have to bring over all of the scaling decisions made over the previous 10 years of Intercom into this new environment, where we'll probably get away with not having to make the same decisions for years and years and years as customers slowly move over to it or spun up on it. And so I think for the realization work, other things that led to a kind of certainty or allowed us to make progress on it was being certain around what types of data need to stay in the EU and what types can leave.

Brian: Because we want to build our customers and we want to do that at one place. We don't want to have multiple building systems, multiple Salesforce installs, all this kind of stuff. But we know our customers data do does cannot leave the EU and so we were able to make clear demarcations around those things. And we still got a bit to go. We've got better customers. It's been good fun. And we've learned a lot about QAing an application which has stayed at one place for 10 years, and then suddenly we're trying to get it working elsewhere. That's definitely had to have a few different approaches. We had a top down approach where we would pick a feature and see does it work, and then we had almost infrastructure up of where we'd pick a bit of infrastructure and go, "What's this used for?" Or an SQSQ, "Let's figure out where this is used."

Jeremy: [crosstalk 00:41:52] "Do we even need it?"

Brian: Almost like tracing in two different directions. And between that and between better usage, we've got something which largely works and we're willing to kind of put in front of customers. So I think there's probably some interesting blog posts and future talks about this kind of stuff. I think for us, we were never able to get the full certainty that it was worth doing it, but what really unlocked it was doing it as small as possible, but also be willing to fail. We were happy to just try this for three months, if it doesn't work, we'll walk away, come back to it again. Maybe we've got a better business case. But we're able to build something, validate it that it works, and then just start to do the rest of the work in the meantime. And so we're hopefully at the end of that cycle at the moment.

Jeremy: So I think it's funny that you're talking about multi-region infrastructure, which some people sort of talk about... Cavalierly, is that a word?

Rebecca: [crosstalk 00:42:44] Yeah, you nailed it.

Jeremy: That they're like, "Oh, multi-regions. Not a problem." And you're talking about having some of your best engineers, they can make those decisions. Running an experiment to try to see if it would even work, and then actually having to make all of these changes and do all these different experiments just to get it up and running. And I was trying to come up with some joke about Ireland and old castles and dragons. I can't formulate it in my head. But basically, I think you found there are a lot of dragons that you hit up against for this. And I guess the question is if you have somebody else, especially a smaller company, thinking about going multi-region, can you just tell them, "Don't," or, "Really, really think about it."

Brian: I would tell people don't. When I worked at Amazon.com, it was a pretty large business. I worked there before for Intercom. And reasonably successful as an online retailer. It was single region. It worked well. I think multi-region stuff is hard. If you haven't been baking it in from very early on and you keep building features and keep storing data in lots of places, it just greatly increases the difficulty of ensuring that your data is portable or that your application is portable into a new environment and that it'll boot and just work Intercom's 10 years old. We just have 10 years worth of features to try and figure out, do they work at all? And all have kind of interesting bits of configuration.

Brian: And our code base is so large that there's different styles and just a large amount of work to just get a lot of it working. So I think understanding whether multi-region is important to your business early on can kind of help you shape how easy it is by doing upfront work. To make sure you might... Making stuff portable, such that you could move of data for a customer from one place to another. I think that's very important, just deciding what data needs to be portable. And then deciding which kind of building blocks are essential to your infrastructure. We were able to strip off large amounts of kind of ancillary things that...

Brian: For example, we run query killers against our MySQL databases that look out for very naughty queries that take up a lot of resources, but we decided there's no way we're going to deploy these out into an environment which just has a small number of customers. If a query went bad, we'll probably find out about it and fix it or something like that. That there's a bunch of things that we were doing at high scale that we probably won't have to do for years and years in the new environment. So we're able to really minimize that and just not deploy a lot of stuff. And knowing what that is can make your life a lot easier.

Rebecca: I'm going to shift gears a little bit because I want to make sure we can cover this. I spent a little time doing one of my favorite things, which was learning about you through what you tweet. A tweet that I really enjoyed was a retweet that you did of Patrick Collison, who's the CEO at Stripe. And he talks about how he ruined ducks for himself. And it's because he assigned a custom ring tone to his pager duty and he chose the duck sound from the iPhone and lot of problems and pages followed. And then years later he's walking through a park and he hears a soft quack, which how cute would that be? How nice is that? And instead he-

Jeremy: [crosstalk 00:46:23] It becomes a trigger.

Rebecca: He shivered involuntarily and his pulse quickened. And that's how he ruined ducks for himself. And I thought that was the perfect tweet for this thing that you talk about a lot, Brian, which is you were able to turn a point of pain into a point of pride for your engineering team. And really to build this engineering culture at Intercom, regarding out-of-hours, on-call work.

Rebecca: And so I'm wondering if you could talk a little bit about what the pain... I think we all know... Or especially those people who are ever assigned to pager duty and on-call hours, know what that pain is. But if you could walk through some of the steps that you applied to say, "Wow, this is actually burning us out. And here are the ways that we approach this problem. And here are ways that I think other people might be able to..." A. Figure out this is where we start, and this is how we turn that pain pride.

Brian: Two of the biggest influences actually came from my time at Amazon. So one was, I was pained a lot at Amazon. So I was determined to not put other people through this. When I joined Intercom I was like, "I want this to be a great place to work. And I don't want to have to build out an infrastructure or build an environment of where we have a really high on-call load." So that was definitely a big motivation. But another one was an interesting one. For a long time S3 had one person on-call and S3 was relatively large, even back then. I can't imagine the size of it now. And then I looked at Intercom back then, say, four or five years ago, and we had six, seven people on-call and we were nowhere near S3 scale. I was like, "This is kind of silly."

Brian: And the load of on-call was really different depending on which team you were in. And one of the things that we like to do in Intercom is we change our team layout a lot. We are pretty responsive to what we're building and we change ownership of things a lot. There's some small problems that comes out of that, but one of the good things about it is it allows people to get involved in different things and grow in different directions and just try out different teams. But then suddenly we had this barrier of people were avoiding joining some teams, because it happens that their area of work involved in them owning a load of elastic search clusters. We had other teams which basically had no on-call and were never getting paged in the middle of the night.

Brian: So this unevenness of on-call being applied across different parts of Intercom just seemed like a barrier to flexibility in Intercom. And I wanted to do a good job here get to S3's level of having one person on-call. And so we figured we could also make it volunteer-based. And this is a big element of why I think it was successful. As a systems engineer, at different times of my career I've had things like new newborn children in my house and stuff, and it's not always the... Just because you're working in operations teams or you've got a background in systems engineering. You don't necessarily want to be doing 24/7 on-call. It can be fun. In some places it's kind of part of the job. But I think being able to opt in and opt out depending on where your careers at, depending on your personal preferences or what's going on in your life, that seems pretty attractive and a nice way of making this more of an opportunity rather than a burden that people get of just being on certain types of teams.

Brian: So we built a volunteer-led on-call set up where we have one person on-call for all of Intercom. I'll refer back to it again, but Intercom tends to be built out of the same stuff over and over. That actually adds a lot to just the possibility of being able to do this, is that you tend to get pages for the same kind of things. So certain patterns that kind of come out. Like, "This SQSQ is full," or, "We're getting five hundreds on this [load bouncer 00:50:34]," or, "This database has slowed down." Tends to be a lot of those things quite frequently. And so we made on-call as well something that you, not just [inaudible 00:50:44], and you were there for a while. The idea would be you do, say, six months in a six person on-call operation. We would emphasize the kind of learning involved and give support and help. And make it something that we would celebrate and shows up in people's promotion documents and annual reviews and things like that. And really celebrated the benefits.

Brian: I guess, one of the other things we did was got really ruthless on alarms. And we turned off so many alarms, and we also review every alarm that fires. And there seems to be this social pressure of... If you get somebody out of your own team out of bed to deal with something, you kind of tolerate it. But if you get somebody who you've never met out of bed at 3:00 in the morning, you're kind of more likely to go and fix this because you don't want to do that again. It's kind of embarrassing.

Brian: So we've found that by paying attention to every single alarm that fires out of ours and opening an issue with the teams and giving them kind of feedback on it, we've gotten really good at maintaining the list of paging alarms down to roughly what's necessary. And that has then fed into making the on-call really healthy in itself and sustainable and where people aren't just growing and getting paid for on-call and all that stuff is nice, but they're not getting paged that often either. And we've had good streaks of where I've been on-call for a week and just gotten zero pages, which is kind of cool.

Jeremy: And that's the big thing, which I think some companies don't understand, like, "Hey, we signed up for pager duty," or, "We have this other thing." And then suddenly, on-call becomes on-duty because you're going to get 15 pages or whatever while you're on-call one night. And there's a big difference between being on-call and on-duty. And if you have so many problems that you need an engineer just watching stuff 24 hours a day, you got to pay an engineer to watch stuff 24 hours a day. But if you're going to go this on-call route and make people sane, then you have to... It's got to be an emergency when that pager goes off. It's got to be something that is significantly affecting things. And I know you mentioned you getting aggressive about turning some of those pages off.

Jeremy: Another thing that is a big topic here is resiliency. Sometimes a service can go down, might be able to recover on its own. You might not need to page somebody when a service goes down, if you've built that resiliency and it has some time to recover. Now if it doesn't recover then, okay, well now maybe you have to page somebody. But I think those are the kind of things and those are the kind of strategies, which by the way, it's really hard for small teams to do because there's an investment in all of that stuff to make that work. I think that's great advice. And I think certainly if you are working for a company right now and you're getting paged all the time when you're on-call, bring that up with your manager and say, "Hey, this is not the way it should be. And maybe we should make some investment towards this."

Brian: I think it's really great for things like retention and for helping people grow. I try to think of on-call as a positive thing, even though I've got plenty of scars and not everything goes well.

Jeremy: It's from those dragons.

Brian: We have bad pages, we have stuff that breaks and breaks in ways that it shouldn't and all that. But just trying to keeping that under control and making sure that people feel in control of the situation of when they're on-call, that they're respected. That when they are paged, it's probably valuable. That means that we're applying a good quality bar to what's going off. And I think that helps a lot with maintaining good quality and overall good customer service as a result of, "We know what's important when it breaks or not."

Rebecca: Even though you can't see our listeners, I think every single one who is a software engineer is vigorously head nodding. They're like, "Yes, yes."

Rebecca: Well, on behalf of engineers everywhere, thank you for sharing. Not only all of your knowledge, but I think especially this type of cultural impact that kind of can really change people's lives and the way they approach their work and the way they feel about their work. So I know that Intercom is hiring. Just saying, engineers out there who are listening. And thank you so much for joining us and sharing everything with the community. I'm wondering if you could tell us a little bit more about how our listeners can find out more about you and find some of your work, which I think is really cool, the Intercom engineering blog as well. So give us a few of those and we'll drop those links into our show notes as well.

Brian: I'm on Twitter at Brian_Scanlan. And I'm somewhere on LinkedIn. I can't remember how to search me on LinkedIn. I don't do that very-

Rebecca: You're [Scan Limby 00:55:18].

Brian: Okay.

Rebecca: Just so everyone knows.

Brian: And Intercom.engineering or just Intercom.com/blog. Intercom.engineering will bring you to a place where we showcase a bunch of our engineering stuff, but there's loads of stuff on the Intercom blog. I'm up there every so often. And we've got a pretty decent podcast where also I show up every so often as well.

Jeremy: Awesome.

Rebecca: And that article that we talked about with you so much, is also on the front page of that engineering blog. So if you all want to read it for yourselves, you can find it right there. It's great.

Brian: Awesome.

Jeremy: Thanks, Brian.

Rebecca: Thanks so much, Brian.

Brian: It's been great. Thank you so much.