Episode #58: Observing Serverless Observability with Erica Windisch

July 20, 2020 • 67 minutes

In this episode, Jeremy chats with Erica Windisch about the challenges with monitoring and troubleshooting serverless applications, why observability is so important with serverless, what advancements have been made over the last year, and so much more.

Watch this episode on YouTube:

About Erica Windisch

Erica Windisch was the co-founder and CTO of IOpipe where she helped organizations maximize the benefits of serverless applications by minimizing debug time and providing correlation between business intelligence and application metrics. She is now a Principal Engineer at New Relic.

As an advocate and pioneer of cloud computing since 2001, Erica is always pushing forward as technology and the industry adapt. She was an early contributor to OpenStack and maintainer of the Docker project where she worked on hardening Linux containers and establishing corporate and community security policies.

Erica is a champion of AWS Lambda and serverless technologies, and she speaks frequently at conferences about AWS Lambda and other AWS solutions. She's passionate about systems architecture, security, and the future she sees for machine-automated, low-code development.

Twitter: twitter.com/ewindisch
New Relic: newrelic.com
Personal email: erica@windisch.us
Professional email: ewindisch@newrelic.com

Watch this episode on YouTube: https://youtu.be/T1t_P_zqOiE

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and this Serverless Chats. Today I'm joined by Erica Windisch. Hey, Erica, thanks for joining me.

Erica: Hello. Hi. Thanks for having me. Or thank you for having me.

Jeremy: So you are a principal engineer and architect at New Relic. And you're also an AWS Serverless Hero. So I'd love it if you could tell the listeners a bit about your background and what you've been doing at New Relic.

Erica: Oh, gosh. Okay, well, my background is pretty deep. So, I'm at New Relic now. Before New Relic, I was the founder and CTO of IOpipe, which was an observability product for serverless applications. Now, I am working as an architect and principal engineer for New Relic. And if we're going to rewind history a little bit. I previously was a security engineer working at Docker, where I founded their security team and their security processes. I was involved in OpenStack from very early, since its founding. And then before that I had actually had my first company and we had built a cloud. We actually had our own cloud services. We were building from 2003 actually, building out horizontally scalable cloud services. And I said, "Well." We bought really early into that pets versus cattle idea.

Jeremy: Nice, nice. Well, so obviously you're doing a lot with observability. And you're doing that in New Relic, that's sort of what New Relic does. IOpipe was all about that. I know a lot of the team has gone over from IOpipe to New Relic, to continue to work and expand their services. And I'd love to talk to you about that today. We've done a number of shows where we've talked about observability. But that was probably almost a year ago at this point. And I'd love to get a sense of sort of, where things have gone, where things are going. You know maybe what the future is going to look like. I got a bunch of other things I want to talk to you about. But maybe you could just start, just in case listeners don't know, what do we mean by observability?

Erica: Oh, gosh. The way I see it is, being able to really see what's happening in your applications, in your infrastructure and doing that... I would say early monitoring. Things like Nagios was, I would not consider observability that was monitoring. It was very much, very reactive. There was zero product... It was not productive at all using something like Nagios. Logging products give you some ability to start getting into, being able to be proactive. And I think that observability kind of ties in some of the concepts from logging, and ties it in with your metrics and ties in being highly correlated. And also deeper into your application, having traces in your application, having contacts for your applications. For instance, just having a trace and knowing that, say an API gateway triggered a Lambda is one piece of information that you can have, but knowing say, the resource path, the HTTP method, things like that. That's a deeper set of insight that I think is necessary. And definitely fits within an observability picture that is very much say different and distinct from something like Nagios. Or even just plain text logs.

Jeremy: Right. Yeah. And we've talked about on the show the three pillars, right? You've got monitoring, tracing and logging. And so monitoring, like you said, is that sort of general like just something goes wrong, maybe you get an alert, something like that. The logging bit is obviously logging data. But let's get more into tracing a little bit. What do we mean by tracing?

Erica: Sure. The way I look at tracing as being able to see the relationship between various components, and not just the components. And I think this also where maybe tracing generally... in our industry historically, has been this service talks to this or that service. And that service talks to another service, etc. I think of it as this function communicates to this other function. And that is true, even outside of serverless, where functions are the primitive. Serverless was a really great place for us to start because it's already segmented into functions. But if you're looking at a microservice, there's no reason that you can't think about your code, about say this functions or this component or this resource path is communicating to this other function and also, contextually. So, for instance, maybe this service only calls DynamoDB when it's inserting data. Or when the API gateway, there's a put request, right? That triggers a put into DynamoDB.

You don't get a put into DynamoDB when you do a delete on a API gateway. So that's the kind of context that I think is really interesting for things like tracing. That is a little bit more I think, beyond what traditional tracing solutions have been doing.

Jeremy: Right. Because I mean, it's a lot different in these distributed systems. I mean, even if you're just talking to one microservice, it's usually you talk to one microservice and maybe you want to see that continuity there, or service X called service Y. But now with serverless specifically, we have you function X calling service Y, which generates an event that then gets picked up by EventBridge, and then another service picks it up and so forth. So we can get into why it's important in serverless applications, but is there anything else, where observability is different in the sense of monitoring? I mean, you mentioned the idea of being a little bit proactive. What do you mean by being more proactive?

Erica: Well, by being proactive I guess I'm referring to the fact that things again, rewinding history a little bit and going to something very distinctly not observably like Nagios. Again saying, "This very reactive." Something went down and now we then asked it, "If it's up?" And I said, "It's..." And we couldn't reach it, so then we determined that it's down. I think that kind of step two of that kind of journey towards observability, would say, "Okay. Well, we have logs, we have a logging product. And the logs told me that when, I don't know. This service tried running to the Syslog Server, they got an error." Well, when I get that error, I know that at least this system cannot talk to the Syslog Server. In fact, maybe I know that a hundred systems cannot talk to the Syslog Server. And I think two things come out of this. One, is that it eliminates the kind of the, maybe the falsehood of a binary status for uptime, right?

Because maybe that Syslog Server is up from the perspective of say Nagios. Or from the perspective of machines on a segment or on a subnet. But maybe there's machines in another AZ or another subnet, and those are the machines that cannot talk to it. And that's contextual information that is really critically important. I guess you can argue that it is still somewhat reactive, because you're still basing it off of say something like logs. But you're not polling for that kind of information directly, necessarily in order to have basic fundamental understanding of things that your applications should already be knowing about themselves.

Jeremy: Yeah. And Sheen Brisals wrote an article the other day that I thought was really interesting, where, with all the observability in place on the serverless applications that they have at LEGO, they basically said the system reports when everything's healthy. Or the alerts are, "Hey, everything's working, right?" I mean, you can see that everything is going through that pathway. I think that's really interesting too, because it's not only about maybe this service not being available. It's very much so about this service not being available if you try to put data this way, right? So you can see that with tracing and you get a much better understanding of, "Well yeah, the system's not down." If I was just getting alert saying, "System's fine. Systems fine." But then you're seeing a consistent pattern of certain messages failing, then it's really great to have that tracing and the ability to go in and then dive into it and say, Oh okay. When it's shaped this way or when it comes from this component, then there's the error."

Erica: Yeah, exactly. I think that being able to know when and where, is a vital component; Like Nagios for instance, would tell you what, but it wouldn't tell you... it would tell you when, but it wouldn't tell you really where necessarily right? Which applications are having problems communicating? And I think context is really the important key for me here. And being able to facet that data and tell you exactly, where it's happening and for who, tells you a lot about why it's happening. Because going back to the subnet example, if you can easily look in your observability tool, and see that all of the services that are in this subnet are having a problem communicating, you start to really flesh out the why of a problem, much more quickly than if you just know that, that service cannot be communicated to, and you don't have any other additional contexts.

Jeremy: Right, right. Yeah. No, context, I think is super important. All right. So why is observability so much, or I don't say so much more but extremely important in serverless?

Erica: Well, I think one of the things about serverless is the fact that, it is broken up by default into these many pieces, right? So, you by default you have much more sprawl, you have many more services. From a perspective of, instead of a monolith which contains many functions or many endpoints or resource paths or whatever. You get, maybe potentially many functions that serve that application. And I think that, kind of two things come out of this. One, is the capability to pinpoint things more accurately because that context is kind of baked into serverless. Because when you know there's a problem with that function, and that function has a very narrow scope, right? That gives you a really strong context into what is happening, versus it's this application. Right? Because no one gets a function, you have that built in context. So I think that serverless actually enables you, more so than the fact that it just needs it.

And, yes, I think that's for me, the biggest thing. In terms of other reasons why maybe it's important for serverless, is just because there are maybe a lack of other traditional tools. So, you wouldn't run, maybe some of the more traditional tools in the traditional way, with a serverless application. So, you're not necessarily getting that broad picture that isn't clearly defined. So, in some ways where serverless kind of forces you into this deeper contextual awareness of your application. It also kind of requires deeper contextual observability for this applications. They kind of go hand in hand.

Jeremy: Right. Yeah. No, and I actually find with a lot of the tools we use in serverless, whether it's SQS or EventBridge, or some of these other ones where, you don't really see what happens. It's very black box for some of these transport mechanisms that are in there. So, being able to connect that stuff together. You can't go and look at your RabbitMQ logs, for example and see what happened if messages got lost or if they didn't get delivered or something like that. Whereas, you put it into SQS that's just not available to you, right? You have to see whether or not it actually happened. And without recording that and being able to trace that all the way through, there's obviously a lot of data that's missing there, if you don't have the right observability tool in place. So what about the challenges that the developers see when they're trying to, monitor these applications and troubleshoot those applications?

Erica: Sure. I think a really good point with the SQS, and I think this also exists for services like Kinesis, you don't have traditional logging for these services necessarily. Sometimes they can act like a black box. So knowing the context for which your application is consuming from those services. What kind of messages are getting the rate? How it's partitioned? A lot of that information is contextual provided to the Lambda. So I think that observability of the Lambda itself for instance, can give you some insight into those services that you don't otherwise get. I think another challenge is again, relating back to the sprawl. There could be many components of a serverless application. And I think that, first of all these are distributed applications. And not everybody's familiar with and comfortable with shipping and observing and operating a distributed application, in the way they are with maybe monolithic, non-scalable applications.

And I think that a lot of users do really need tools, to help them bridge that experience gap as well. And also just, even with the experience, it can be a really valuable tool to help you visualize what is happening in your application.

Jeremy: Right. Now, what about the fact that it's so ephemeral. I mean, we see containers being very ephemeral now as well, but as the fact that functions disappear after a few seconds. Is that creating problems?

Erica: I think it's a different way of writing applications. I think something that happened, especially... well something I saw a lot early on when we were doing IOpipe, was that users wouldn't necessarily always account appropriately for how the Lambda environment worked. So they would assume that things could be long living where they couldn't. Certain libraries which made that model very difficult. Some of the database libraries in particular, were a really frustrating challenge for a lot of users. AWS has made some progress on building proxies and data APIs and things like that, to kind of bridge that gap. Because some of those libraries are kind of fundamentally, maybe not incompatible but less compatible with the serverless, with the AWS Lambda model, at least.

Jeremy: Yeah. So when you say incompatible though, I think you'd mentioned to me before that there was a W3C standard, that is sort of standard now but not necessarily standard in X-ray?

Erica: Yeah. Well, there's a W3C standard for... well, trace context. WC3 trace context. And New Relic was actually involved in creating that. We have some engineers, I think Justin Foot and Erica Arnold, in particular were involved in that and maybe some others. And, with that the idea is that HTTP... It defines it for HTTP headers in particular. Although the actual encapsulate data could theoretically be put over other protocols, but it does the over HTTP, W3C I'm sure. Right? But the idea is that, this a standard set of headers that can be passed along vendor agnostic, throughout services. So, if all of your applications support W3C trace context, even if they're using different libraries by different vendors, as long as they all support W3C trace context, you can actually have complete traces through all these applications. Now, the AWS services do not currently, as of the time I'm speaking, support this trace context. They do support X-ray headers, so they can pass those along.

Jeremy: All right. So what about the advancements that have been made over the last year or so? Because again, a year ago it was pretty cool, right? There was a lot of great tracing software, there's been more vendors jumping into the space. So has there been some maturity you think with these tools over the last year?

Erica: I think so. I think one of the big things we had out of New Relic, is the recent launch of Infinite Tracing on New Relic Edge. I was actually involved in creating the Edge element of this which is, primarily in the first cut a provisioning solution for various services that New Relic will provide on the Edge. The first application for that is Infinite Tracing. And Infinite Tracing allows you to throw millions and millions of spans, at a service that lives on the edge. So it lives, say if you're in AWS, it lives in your AWS region, it receives those traces at high data rates right? So we can ingest at tens of gigabits per second, per trace observer. And then once we consume those, we can then apply machine learning and other filtering mechanisms, to help you sample appropriately. So rather than, traditionally with tracing what would happen is, your agent whether it's open tracing or it's a New Relic agent or one of our competitor's agents. It has to make sampling decisions in a vacuum right? In a fairly stateless way.

What we're doing here instead is we're receiving all the traces, but then batching them together and able to filter them out on the back end. Right? So that we're only storing so many, but then we're actually able to do back end filtering, of larger batches. So there's much more context as to which traces are important and which ones we should keep, and which ones are unimportant and we should throw away. So that I think that's a really big change in how tracing works at New Relic, and for the industry, potentially. And something that we're also doing is we're releasing, I don't know the final product name. Because we're doing this call a little bit in advance of the launch. But it will be an X-ray integration where we're able to, ingest X-ray traces and correlate that with data that we have in New Relic. So when you have a Lambda, you'll be able to see not just all the traces that are within the application and the traces out, and context for that trace, but also be able to see the AWS services and see through those services.

So, for instance, if you're triggered by an API gateway, now you're going to have context for those traces, in the same way you would get from X-ray, but you have that now pulled into New Relic. So, in those places where we don't have deep observability because there are components that we cannot instrument, because of third party services or because these are third party tracing products like X-ray, we can actually pull those and they tell you a more complete story.

Jeremy: That's awesome. Yeah. Let me ask you some questions about, how some of that third party stuff works or the X-ray stuff. Because again, I know AWS has added some capabilities where they pass trace headers through SQS and things like that. But that's not available on all their products, obviously. And I think, probably I think of it this way, because I mostly build web applications. I'm mostly thinking about HTTP. Right? But there's a lot of other things that are happening. So where are we with that kind of stuff? With sort of the non-HTTP, messages being passed around?

Erica: I think it's interesting, because it's something I've been thinking a lot about recently. In particular, because AWS does have that for SQS and I think SMS. And that's not a place, where I think a lot of vendors are necessarily looking for trace headers. It's a place where W3C does not define standards, for how to pass along data in these non HTTP ways. But I do think that W3C trace context, like trace pair headers, the values of those headers could be passed along in places that are not HTTP. And I think it's going to be really compelling, once all these different services are able to support these. When we actually look at, what does it mean to have... for instance, do we get to a future when you write data into DynamoDB? That you can actually pass along a trace header? And then the trigger that comes out of that, right? The Lambda trigger off the DynamoDB stream, can actually have context for some of those traces? I don't know. It'd be really interesting to see that feature. I think we're just kind of at the beginning of that a little bit.

Jeremy: Right. Yeah. Because I mean, if you're doing something with DynamoDB now and you want to read off the streams, or even I think Kinesis, right? You still need to pass in your own correlation IDs, in order to trace those back, right?

Erica: Yes. And there's lots of questions about how that would work. I think in the case of Kinesis, which I know a little bit better than Dynamo, to be honest. You have individual records. So I think that in this case, it would be a trace for where that record came from, not necessarily... because you don't want it from the batch, right? Because if you have it from a batch, it's not really very useful. You want to have it down to the individual record.

Jeremy: Interesting.

Erica: But yeah, you're right, you can kind of encapsulate that in yourself right now. But none of that is kind of built in by default. And there's no way that a New Relic could just say, modify people's Kinesis records because that's arbitrary base64 data.

Jeremy: Right. Yeah, exactly. Exactly. No, I mean that's what I'm thinking. It's like it'd be really great to have that extra stuff. Now does X-ray have any of that data, that you're you're able to import now?

Erica: Oh, gosh.

Jeremy: Right. Because you can't trace DynamoDB all the way through with X-ray. I don't think you can.

Erica: Yeah. So the SQS example, I'm sorry. The SQS example is something where we could potentially do that. We're ingesting that X-ray data. So if X-ray has that data, it's passing along those values and X-ray collects it, then we can ingest it and we can give you that context. Our agents are not directly getting that data, which just means that it's going to be harder for us to correlate it. But honestly, until, say SQS and AWS have W3C trace context support, we're probably going to be a little bit of a gap period before, we get the kind of correlation that we want between native New Relic traces, whatever that is, and native X-ray traces. Because once you have W3C trace context, you don't really have this concept of native necessarily anymore. There's something called a trace state which is a vendor specific field. And for the large part, we might ignore those. We might decide to support some of those. Well, for the large part, we're currently working with the trace parent, which is a highly vendor agnostic field.

Jeremy: Right. All right. Well, if AWS if you're listening, let's get moving on that stuff because it would be nice to have-

Erica: Yeah. And I definitely give them my feedback.

Jeremy: All right. So take off your New Relic hat for a second. And I'd love to just get your insights, just into the overall landscape that, where we are with serverless observability. So obviously, the landscape, the number of vendors that are getting into this space. When we have IOpipe and now New Relic, Thundra, Datadog, Epsagon Dashboard, Lumigo, Honeycomb and then AWS recently launched their ServiceLens. So we just got all these different tools. So my question is less about, which is the best one to choose. And it's more about this idea, I think they're all doing something slightly differently, or slightly different they're trying to add a little bit of, I guess what's the right word for it? Distinction between them. But I guess it's a good thing, right? That we're getting all this competition. What does that mean for serverless adoption? Does it mean anything for serverless adoption?

Erica: Oh gosh. I think that, the success and failures of serverless observability over the last, couple years. And the larger social economic, landscape or economic landscape rather, especially around COVID and everything else. We have a world right now where, I'm not really sure. I think that, I definitely would have preferred that IOpipe could have stayed independent for longer, to be quite 100% honest. And, gosh. I think there is a distinction between these products. One of the things that we determined in IOpipe towards the end, was that we wanted to start going broader. We had gone very deep on serverless, we wanted to start going broader and not because... well, for a couple of reasons. One was because we found that, almost nobody is running just serverless applications, right? They're running serverless applications as part of, a bunch of applications that they're running right. They have business needs and those business needs are not entirely serverless business needs.

IOpipe was a company that we were running everything on serverless but that was not the case with the vast majority of our customers. So we wanted something that was broader. And I think that New Relic was a really great way for us to look at saying, "Here's a way that we can go broader without having to, become a full competitor to New Relic. Without having to build out everything for every service." Because that's really important. Users need to have their whole application observability and IOpipe was doing serverless observability. I see some of the other competitors now also going a little bit wider, a little bit broader with their missions. But I think it's challenging because, there's a lot of pieces there to build, and you have to decide which ones you're going to build first Are you going to build out, really fantastic tracing? Are you going to build out fantastic logs? Are you going to build out fantastic monitoring? Right?

How much of these pieces are you individually building out? How're you connecting them? Because you want to make sure it's actual observability, and it's not just a piece of it. And worse, I think that IOpipe was observability. I believe it was. I believe that it encompassed all of these things, but it did it very narrowly just for serverless. And that was an intentional thing that we did because we couldn't build the entire world. There was only so much we can do, with so much money and so much time. So we focus very narrowly. To try and do those, a broad set of things for a narrow market segment, is easier than doing a broad set of features for a broad market segment.

Jeremy: Right.

Erica: But that's what we're doing now at New Relic right? Is going for the broader market segment, of not just the serverless part. Yes, we're doing serverless but it's not just serverless because realistically, you have things that are not just serverless. And it's very hard I think, for really any vendor to do all the things and to do all of them right. I know you told me to take off the New Relic hat but this a question that was really hard, to take that hat off with. Because I do think that we're doing a lot of those pieces and we're doing a lot of those pieces right. I think that it's very possible. I think that companies like Honeycomb do really fantastically with, doing their market segment very, very well. And maybe better than we do at that particular market segment. But that is, a segment of the market.

And broadly speaking, we have customers who have mobile apps. We have browser apps. I want to get to the future where I can look into my dev tools, and I can authenticate it and logged in of course, that if I am trying to debug my application, and I'm having a failure in my browser, I want to be able to click into dev tools and then jump straight into the line on GitHub, that it's giving me that problem, for the back end service that generated... the problem that went all the way to the front end. That's the future I want to get to. And I don't think that's possible, just doing narrowly focused market segments.

Jeremy: Right. Yeah. And I think that, what I like about companies that are established, like a New Relic and like a Datadog and like these bigger... these companies that are covering, this wider swath of these broader market segments. What I like about them getting into the serverless piece of it is, I think for a lot of people having a good service observability tool, it's an absolute necessity. You cannot have one of those. And if you are trying to build an application and you're all on containers, maybe you still have some EC2 instances running. Maybe you still have some on prem, but you want to get into serverless. If you have to go out and buy a different tool, and try to integrate that into what you're doing, I mean, that just becomes a really hard problem and a really hard sell.

And if you've got these bigger, more established companies that can do all these different things. And you start mixing all of those things together, then I actually think the observer, or the adoption of serverless becomes easier, because now you have those standard tools in place that are just a natural extension of your cloud infrastructure.

Erica: I think that's 100% true. And I think that was one of the biggest challenges we had IOpipe was like, "Great." But users didn't want two tools and the fact that the serverless tool was separate, made it very hard to migrate the users. Right? Because they had to migrate then, not just the applications and the way that they built their applications. But those developers had to also learn a new tool and use two tools. It is significantly better now that we have, a unified platform.

Jeremy: Totally agree. Totally agree. All right. So speaking of hybrid applications, because I think that's what we're talking about. Some people there may be running their main workloads on containers, maybe they're using Kubernetes, or something like that. And then they've got these peripheral things that they might be doing with serverless. Maybe their ETL, maybe their DevOps tasks, whatever. But clearly, you do have a lot of hybrid apps and that's great. That's fine. Do what you need for your workload. But one of the things that I thought was interesting sort of, this relatively new with Fargate and with Cloud Run. Is this idea of trying to take containers and make them more serverless. So how do you feel about serverless containers?

Erica: Oh, gosh. There's so many things here I cannot talk about.

Jeremy: Do your best.

Erica: So I think that it's interesting. I think that Cloud Run in particular is pretty interesting. I think that it's important to meet users where they are. And building out serverless container runtimes, is a really fantastic way of meeting users where they are. That said, I think there are reasons why serverless... So artificial constraints, I think are one of the most powerful tools that we have as builders of infrastructure products, right? I come from a history of building infrastructure products, things like Docker and OpenStack. And one of the things I wanted to do with Docker, and I advocated strongly for, was actually fewer features. I wanted Docker containers to be able to do less. And that was because of a number of reasons. I wanted to have more immutability for the services. I wanted to have more immutability for the logs. One of the things that I found with Docker, was that if you restarted a container, you stopped it and you restarted it, you would get a new set of logs. So if you did Docker logs, you didn't have any of the logs from the previous run.

Why would we throw those away? Those should be immutable. And I was like this should be immutable record. Logs should never be erased. And I lost that battle. I lost the battle of saying that we should not be able to ping out of containers. Because the ping out of a container requires net raw. And if you have net raw, you can do things like spoof the IP addresses of other containers on the same host. So these are the kinds of things that you can do in Docker, that I thought that you shouldn't be able to do in Docker. I thought that we would enable users, by taking away features because the problem is, that in enabling users to ping also enables them to compromise adjacent containers on that host, right? And those are things that we don't want to enable our users to do. We don't want to enable users to lose their log files, right? We want to enable them to have immutable logs. And I think that's serverless Lambda at least, right? Because I don't think you should say it's a serverless thing. I think it's a Lambda thing.

Lambda has done a really good job of having really tight constraints on the workloads. And allowing arbitrary containers, arbitrary Docker containers, for instance or OCI images to run, would mean that your applications can do a lot of things that they really probably shouldn't be doing ever. You should never have an application that can write to arbitrary... if your application was to escalate to a root user, that root user should never be able to write to the Etsy password file or the Etsy shadow file. That should be impossible. Your root user should not be able to do those things. You should have an environment where you are contained in a way, where you cannot escalate in those in that fashion. And I think that, enabling containers, right? Arbitrary containers does take a step back from that.

On the other hand, we do want to meet users where they are, and enable them to build applications in a way that, actually accelerates their development. I'm thinking back to the CGI days. We have so many users that I mean, not just users, like tutorials and blog posts, so that even when... like web operation. So this one of the things when I had a web operation in 2002, 2003, 2004. I mean, it was really the whole 2000s. But that was when for me, it was those were when we had users were shipping CGI applications and PGP applications or PHP applications. And then we started moving, we tried to force users to go into virtual machines and containers, in the mid 2000s because we wanted to stop having users doing bad things on our infrastructure. They kept doing bad things but they did it inside their own sandboxes.

Jeremy: Right. Exactly.

Erica: This is something else that we learned, right? Was that enabling users to do things in a secure way, did not actually get them to start doing things in a secure way. It just isolated them from the other users, so they did it-

Jeremy: From the rest of the system. Yeah.

Erica: Right. But like you, you would find blog posts where they tell you to make your directories node 777.

Jeremy: Oh yes. Yes.

Erica: Right? And that's something that users should have never done. But they did it because a lot of providers didn't have the right security isolation. But when you did provide the right security isolation, and you had your PHP application running as your own dedicated user, and your own container in your own VM, which we did. Users still set their directories to chmod 777. It was completely not necessary. I think it's the same struggle. Users, you give them Docker containers. Almost every developer is going to do it wrong. And forcing them to do it, the best practices. You didn't have another choice. Don't give them a choice to do it on.

Jeremy: Right. And I totally agree with you on the artificial constraints thing. And that's one of the things I love so much about serverless where, it was like there was no state, right? So you had to just think about things differently. And there were circumstances where you're like, "Wow. It'd be really nice to have access to state, to do this one specific thing." But it was a best practice, or it was a good idea to use state to do this. But under normal circumstances like let's say massively horizontal scaling, using state was a terrible idea, right? Because you just wouldn't get the performance. So then we get EFS integration with Lambda. And that changes quite a bit. Now, I think there are a lot of really, really good use cases for that. And Lambda would be perfect workloads. But back to your point, I think people can do some really bad things with this.

Erica: Oh gosh. I mean, true. I think a lot of users can do bad things with it. I've actually been thinking about some really awesome/awful things I can do with it. So I set the EFS and I have my own VPN from my house, into my AWS environment where I have an EFS, and I can locally in my house, on my home computers, mount those NFS folders, which is amazing. And then I can run Lambda jobs against the data, that I basically throw onto my NASS. So I can keep my photo libraries on EFS like out of Aperture or out of Lightroom right? So my Lightroom can now store on EFS and then I can have Lambda process my images, that I've been, or my photos. That's a really powerful thing. But also, is that a way that we should be working with our things? I think there is value in the fact that we are enabling use cases and workloads.

Another application I've been working on has been email. And I did a whole talk on how I kind of failed at building out an email system. And one of the things I did not talk about was how EFS would make this better. Because EFS wasn't announced yet.

Jeremy: Wasn't an option, right. Yeah.

Erica: Yeah. But one of the things was, you can have SES right into S3. Have S3 trigger a Lambda to write those email messages into EFS. So now I'm using Lambda to actually write in EFS. My applications that are reading from EFS are actually container applications running on Fargate. And that's because, an IMAP server cannot run behind API gateway. It can't run really anywhere serverlessly. If you want to run an IMAP server, you need to run it basically in Fargate or EC2. And so that was the model I picked. So now it's like I have, an IMAP server, Dovecot running on Fargate reading off EFS, and the files in EFS are written to it from SES. And the only way to do that is with Lambda. My alternative would be to write I guess, an SCS consumer that would pull from it. Or put it in a SQS and then write an SQS consumer that runs on Fargate.

And here, I could just write the Lambda, which is, a lot easier, a lot more powerful, a lot less to maintain and then I only have to have the Fargate for the IMAP. And the other thing is, the IMAP doesn't have to scale as much too, right? Because the IMAP only has to scale for the number of people who are reading say, a mailbox.

Jeremy: Mm-hmm (affirmative).

Erica: Writing the email messages from SES. I mean, that's the many to one problem, right? The IMAP is a futile one.

Jeremy: Right. Right. Now, what I'm concerned about is that someone's going to be like, "Oh well, now I have a file system that's shared. And I can just connect Lambda to it. So now I can just use Lambda as a web server. Right? And just load files off of that." Because, you just know someone's going to do that. Right? I mean, we've already talked about in the past like serverless.

Erica: I will definitely do that. I will definitely do that.

Jeremy:
Just for fun. But I think that like you said, meeting consumers where they are. I mean, I wonder if EFS though, especially when you get down to things like machine learning and some of these other things, you get to load really large notebooks or you've got a lot of data that needs to be loaded in and streaming that from S3 is just one, expensive and slow. Do you think that maybe EFS might dissuade or open up new possibilities where Docker containers might not be as needed?

Erica: Well, I think in my IMAP example, right? That's an example where I would have had to build that application entirely on top of Docker or EC2 previously. Now, with EFS, I could build a hybrid application, that is partially built on top of Lambda.

Jeremy: Yeah.

Erica: It doesn't get me all the way. And I guess there's an argument... well, containers on Lambda wouldn't solve that problem for me either though, right? Unless they were able to give me arbitrary ports, which would be amazing. But until AWS gives me like arbitrary TCP/IP, I'm going to be stuck having to least run the non-HTTP services on Fargate. ECS definitely did enable me to take that particular application, and not run half of it on containers.

Jeremy: Yeah, right. Right. All right. Well, so let's move on to talking about maybe the future of the cloud, because I know you did a lot of work on OpenStack. We've got Kubernetes. Right? Do you see, and maybe we bring this back to open source, right? So you've got all these big open source orchestration systems and cloud orchestration. Is that what we think it's moving to do? We think we're going to see, the OpenStacks and the Kubernetes being just the dominant players in terms of how people are building cloud applications?

Erica: Oh gosh. I don't know. I've become a very skeptical of open source over the years.

Jeremy: Okay.

Erica: I think there's a lot of traction for the fully hosted services. I mean, Lambda, excuse me. Lambda is interesting because it's completely closed source. I guess you can say Firecracker. I have hiccups. Firecracker is kind of a partial open sourcing of Lambda. But it's open sourcing of the pieces that are very much not serverless. Right?

Jeremy: Right, right.

Erica: It's open sourcing of something that looks a lot more like traditional architecture. But we also have Kubernetes and like EKS, alas the Google version, Google Container Service. It's interesting because they are shipping open source solutions, but part of this is like now AWS is charging you 10 cents per hour I think, to run an EKS cluster. And I mean, I think it comes out to $70 or something a month, just to run an EKS cluster. To run no applications on it. I want to tell you as an individual developer, I am not going to do that. It's not important to me as an individual. Now, of course, as a business, building business applications, it can very much make sense to pay $70 a month, to manage your application. As an individual, I mean, I can go buy a Raspberry Pi and put it in my garage. And I think that is, kind of important because even though that may not be the market that like AWS is looking for, it does mean that you have fewer developers experimenting and learning, with these technologies because, from a learner's perspective, they're more difficult to access.

And I do think that open source in theory, provides a lot of opportunity for learning. But all of these solutions are way too complicated. I think Docker was a really great example of a successful open source project, in that it was very easy for developers to use it and learn it. Kubernetes is way too complicated for the majority of developers to pick up and run in, their house on a Raspberry Pi or on a small server or a small VM, for them to experiment and play with. It's too much, it's too difficult, it's too expensive. And honestly, I don't think that it's a fundamental problem with building orchestration solutions. I think it's a fundamental problem of these being corporate solutions. These are solutions that are being built by enterprises for enterprises. They're not being built for learners. They're not being built for developers who need to enter this industry or to get their next job. In fact if anything, they actually create more barriers than they create solutions in some ways.

Jeremy: And I wonder, you think they may be victims of their own success, right? It gets popular, and then you start getting a whole ecosystem around it and then they get more complex and more complicated. And then, then you get things like EKS, where Amazon says, "This is too tough for any normal person to manage. So we're just going to build a service that abstracts that away." Is that something you think about?

Erica: It is. And I think that EKS can be strongly contrasted against ECS. Right? Where, you have a service that is fully managed for you. And for me to get started on ECS, I had to spin up an EC2 image, which honestly, I have to say it was a little harder than I thought. You could theoretically use Fargate although, I've had a lot of trouble with Fargate. I'll get like just error saying... I'll set everything up in the way that it's supposed to work. And then it just says, "Oh no. Fargate can't actually run this workload." And it just says, No. It doesn't say why, it just says no. And I'm like, "Okay, well, I'm just going to spin up an EC2 image and run traditional ECS." But even then, it's what? An EC2 image and a Docker container, it's not, "Here's my cluster. Here's my configuration for that cluster." It's like EKS is, you have AWS managing a service, and then there's still the service that you have to still kind of manage yourself in there as well. And it's significantly more complicated and more costly.

And I don't necessarily want to run, a cluster at all. I want to have the ECS experience for Kubernetes. Or maybe just no Kubernetes at all, as somebody who doesn't necessarily need it. I just want to run my applications. That's what I want to do. And I want to pay as little for them as possible. I want them to be as easy to set up and run, easy to shut down. A lot of the reasons I like Lambda. Because, I don't have to worry about it or think about any of it. And for the majority of my applications, that is fine. That said, I also have a programmable open source networking switch in my basement, that I have built my own operating system for. So, I can go a little bit both ways with this. But the thing is, that's a choice. I want it to build an operating system for my networking switch and run that. And I don't want to do that with Kubernetes. I just don't.

Jeremy: Right. Well I mean, you also have, I mean, it's fine for your own personal stuff. But if you've got enterprises that are relying on this stuff, then obviously it needs to be fully tested, and it needs to have lots of developers contributing to it. Which is another thing I think is interesting, or I guess an interesting trend is that, companies enforcing is, forcing is probably the wrong word. But they're having some of their employees just work on open stuff or open source stuff, right? And I like the idea of companies dedicating some time and resources to help keep some of these open source projects up and running. But just what are your thoughts on companies that have open source teams that are doing a lot of contributions?

Erica: I mean, I've been on one of those teams. I have been an employee who was just working on open source things. When I was working at OpenStack, I mean it was the majority of the work I was doing, was in the open. I also did a lot of building of like Chef recipes, and integrating those components together and making them work. And I think this one of the things that was kind of touched on a little bit in the question, was the fact that, maybe this is less true with Kubernetes, although maybe not completely untrue. It was definitely extremely true of OpenStack, which was that you had these loosely coupled components, that as an operator you had to figure out how to put them all together and make them work. And everything was tested, but nothing was really integrated. And you needed to have companies that integrate these things for you. That's why you have companies like FTO and VMware and everything for Kubernetes as well. So I think that was an issue.

From a perspective of open source developers though, my biggest issue is the culture. Every one of these open source projects or projects, however small or big that they are. Because I think, I said things like Kubernetes right? Are now multiple projects. You have things like Falco and so forth that are sub projects or adjacent projects or however you want to define them. But you have a community here, that operates a certain way, they have their own culture. And that culture is different, potentially than a culture that you as a company founder or as HR or a manager, or whoever of a company, that has not necessarily the same culture that you want your company to have, or your team to have, that is in the open source. Right? And how do you kind of resolve that difference because, one of the other things is that a lot of people hire from these open source communities.

So if you are building a team that is going to work in open source, and you want to make this a diverse team, for instance. But it's not a diverse project. How does that work? Right? Is the project and the other people in that project, going to discriminate against you, either implicitly or explicitly. It may not be intentional, right? There are implicit biases that exist. And I think it becomes very difficult because, when you have your own closed source application, and you're building things for your own self and your own teams, you have control over what you're building, how you're building and the construction of your team, etc. And I think that you lose a lot of that, when you're working in an open community.

Because if you're only working on open source, it's almost like while you're employed by one company, your co-workers are almost in a sense, a set of people that are not hired by your company. That may not actually hold the same values that you or your company holds. And I don't have a solution for this. But it's something I think about a lot. And it's one of the reasons I no longer really contribute much to open source.

Jeremy: Yeah. Well, I mean and that is, where the problem is. That you get brilliant developers and engineers like yourself, and then because of the culture that just exists in tech, which is many cases pretty bad. That if it's discriminatory, or it's just like you said, maybe they don't accept that PR request because, "Oh, it's from you." Or whatever it is. And you don't know who those maintainers are sometimes or how they feel. And then there's no accountability. Right? That's the other thing that, I think is a challenge in open source. But that's too bad. I wish you would contribute more rather than just writing your own operating system for your network switches. But anyways, so I have one more question to ask you just about this idea of, open source versus proprietary systems. So I love Lambda. Right? I think Lambda is a great product. It's got so many awesome features.

Yes, it doesn't do everything perfectly. Yes, there are constraints, that are some good, some bad. But then you've got Knative or, what's the other one there? OpenFaas and some of these other things. What do you think about that? I mean, I do love open source projects. And I do love what you can do with that open source stuff. But on the other side of the coin, I mean there's, having somebody making a profit off of it, and constantly monitoring it and improving it and listening to customer feedback. I think that's important too. So where do you stand on those types of products?

Erica: Yeah. This is, again, the challenge of open source versus corporate engagement. Going back the reasons why I'm hesitant on open source. But, it's just, it is important I think, to understand where your users are coming from and what your users need. And I think that, a lot of those corporate interests are really good at, having product driven decisions. I don't think that it's necessarily a requirement. But, on the other hand, you have a lot of things that, some of the more successful open projects that do not have corporate sponsorship, do also tend to be things that are more straightforward. The use case is really well in known. I think that some of the video game emulators for example, are very great projects, that maybe don't have as much corporate sponsorship as other projects in open source. But also it's really obvious what you need to do, right? You need to make the thing work technically, to a defined standard. Whether that standard is written down or it's a black box, you're replicating it.

What was interesting for me with Knative and OpenWhisk and some of these others, was that they didn't necessarily actually go to Lambda and look at like, "How are we going to kind of emulate this service?" They kind of went and did it their own way, with their own product, Oryx. And they didn't necessarily learn the lessons, that the other products had learned or this other teams have learned. So yeah, I'm a little bit... I don't know I guess I'm a little conflicted on this, because I don't necessarily see corporate engagement always actually delivering the right product. Because I'm not actually sure that Knative is the right product. I don't think it's picked up the way that, a lot of people hope that it will pick up.

Jeremy: Interesting. Yeah. I just wonder too about, whether there's that question of lock-in. I feel like that lock-in question is just... so many people still ask it or still, I think in a way that is part of their decision making process. But I just think of something as simple as compute as Lambda. And yes, it's got all these other great features, things that can connect to all the eventing that's built in. And then you look at something like Knative or OpenFaas or OpenWhisk or any of these things that are, open source implementations of these. And they have their sets of limits and their features and other things that they do as well. I mean, do you think lock-in and I even hate to ask this question. But I mean, it's lock-in a factor there? Or is it one of those things where, moving a compute service is probably not as challenging as trying to, design for the lowest common denominator?

Erica: I think that for open source, lock-in has a few factors. One is, what is the velocity of that project and its uptake? Because a lot of companies do not want to be the first ones to adopt something like Knative. And they don't want to be locked into it, if it turns out that the project ultimately fails, right? Because now they are locked into something that is abandoned. And nobody wants to be locked into something that's abandoned. But increasingly, kind of going back to, the culture thing a little bit. I don't think most people think about this, but it's something I personally think about a lot. Is also locking yourself into that culture. Because if I, let's say use Linux, I am locking myself into the Linux kernel community to a certain degree. And if that's not a culture that I want to be associated with, or if I don't feel comfortable with that culture, I'm not locked into an operating system, that I don't feel that I can contribute to. That, is Linux open source, if it's not accessible to me as a developer?

If I did not feel that I can contribute to that project successfully, for various factors. Is it actually open source? And is my ability to engage with that project and work with it, really any better? Or is it actually worse than working with something like macOS? Where I might, or maybe in... or even Windows where, I can maybe build a business relationship with Microsoft or Apple, that is non-discriminatory? These are really interesting questions that I've been asking myself recently. And I think it relates a lot to the lock-in, because as soon as I choose a technology, I'm choosing the people that build it.

Jeremy: Yep. Yeah. No, I think that's a great point. I remember seeing not that long ago, someone who posted on Twitter that they were, and I think it was some SQL group or something like that where knowingly refused to address them by their proper pronoun, even though they knew what it was and knew that that's what that person preferred and just ignore that fact. And I think it's little things like that, that push people away, and again it's hurtful, it's hateful, it's disgusting. And those things, that bothers me too. So I mean, keep fighting the good fight for that. I'll do whatever I can, to be as open and welcoming to these communities as possible. It's just one of the things I like about the serverless community, I feel like it has been very, very open and welcoming. And it's just, hopefully a safe space to be. So hopefully you can make more of these tech communities those things.

So anyways, so Erica, thanks again for joining me and giving me all of this insight into observability, as well as into this open source stuff. It is a lot of things that we need to be thinking about in 2020, that I think people have ignored for too long. So I appreciate your voice on this. So if people want to get a hold of you or find out more about what you're doing at New Relic, how do they do that?

Erica: Well, I have a Twitter. It's not only technology though, and I guess you could email me if you want to, personal as erica@windisch.us or professionally I have ewindisch@newrelic.com. If you want to reach out directly, find me on Twitter. Yeah, I guess those are the main places.

Jeremy: All right. And then newrelic.com just if you want to check out all the stuff they're doing with serverless there. Right?

Erica: Yeah.

Jeremy: Awesome. All right. Well, I will put all that into the show notes. Thanks again, Erica. Appreciate you being here.

Erica: Great. Thank you.

This episode is sponsored by Amazon Web Services: Check out the How to Use Objects in Amazon S3 to Trigger Automated Workflows Using AWS Lambda Learning Series.