Serverless Horror Stories

In this article, we summarize some real-life horror stories that illustrate the potential problems of serverless in production — and how to avoid them.

Oct 8th, 2020 11:44am by Emrah Samdan

Featued image for: Serverless Horror Stories

Feature image via Pixabay.

Emrah Samdan

Emrah is VP of Product at Thundra. He is enthusiastic about serverless, observability and chaos engineering.

Serverless is a powerful paradigm that lets application developers focus on business logic rather than scalability or server maintenance. It is important, however, to understand the underlying serverless characteristics that can have a devastating impact on performance or budget, or both.

In this article, we summarize some real-life horror stories that illustrate the potential problems of serverless in production — and how to avoid them. For a more detailed description of these case studies, download our white paper.

An Expensive Mistake

When you pay per server instance, the cost is constant — no matter whether the server is overloaded or idle. If overloaded, it may crash but that won’t increase the bill.

Amazon Web Services’ Lambda functions, on the other hand, automatically scale to accommodate the workload. This is generally good for application resilience and sustaining sudden spikes, but it has to be monitored carefully, as we learned in Kevin Vandenborne’s article “Serverless: A lesson learned. The hard way.”

One morning, Kevin received an AWS notification about a detected infrastructure-cost overrun, from a budgeted $5.00 to a forecasted $83.28. By the time he logged into the AWS console, the actual balance was already at $206.14. A short investigation revealed a simple bug in his S3 bucket configuration that had created an infinite invocation loop and a larger than expected AWS bill.

This story teaches us the importance of carefully monitoring Lambda function workloads for unexpected traffic volumes that will drive up costs and make your cloud bill highly unpredictable. To this end, Thundra features active anomaly detection supported with insights modification to warn you when you have unexpected traffic.

Choose the Right Use Case

Serverless is brilliant when used for applications with varying workloads or unpredictable usage peaks, since serverless services scale up and down as needed, depending on the size of the workload. Thus, you spend more money to cover the peaks, but save money when the API is not being used.

However, Einar Egilsson discovered the hard way that for consistently heavy workloads, a serverless architecture is actually slower and more expensive than provisioning a server or cluster of servers to handle the load. In a post entitled “Serverless: 15% slower and 8x more expensive,” he described how a POC migration of his company’s API layer from Linux-based servers to an AWS Lambda/API Gateway architecture resulted in 15% slower performance and eight times the cost.

Because Lambdas work on a high abstraction layer, they will always be somewhat slower than a handcrafted, optimized implementation. And the dramatic difference in cost is consistent with the fact that the company’s API server handles a heavy and consistent workload of ~10 million requests per day. Even more than the Lambdas themselves, API Gateway can be expensive for heavy workloads.

The Million-Dollar Engineering Problem

One of the well-known challenges of event-driven serverless architectures is the difficulty in tracing requests end-to-end, in order to investigate performance issues.

A case in point is when the Segment company encountered a problem with a popular serverless service, DynamoDB. Segment was experiencing a serious performance issue with their DynamoDB instances that was slowing down their entire system. To mitigate this, the company had to increase the provisioned throughput of DB instances, but this in turn vastly increased their AWS bill.

When Segment’s own troubleshooting efforts failed to uncover the problem, they asked AWS support for help. Using internal tools, AWS generated a partition heatmap of DynamoDB instances. Although the heatmap was not very readable, they were able to spot a single DB partition that was having performance issues, clearly indicating that their workload was not distributed evenly across partitions.

It was still, however, not clear which records or keys were problematic. So Segment continued to investigate the issue and found a relatively trivial bug that was very hard to spot but which, when fixed, reduced their DynamoDB capacity by a factor of four and saved them $300,000 annually.

The investigation likely would have taken much less time if they had been looking at individual messages coming for some of the requests. Thundra’s trace map shows the complete adventure of requests and lets you notice such issues in several clicks.

The Queue of Death

A very similar story happened to Solita, which identified problems with its SQS and Lambda setup. After detailed testing, Solita noticed that some of the messages from the SQS queue were not being processed by Lambda functions and ended up in the dead letter queue.

Solita was baffled by the situation: The messages were valid, there were no errors in the application logs, and the issue apparently occurred only under a heavy load. Finally, they noticed that the invalid behavior occurred only when Lambda throttling was taking place at the same time.

After digging through the AWS documentation, they identified the root cause as a combination of SQS and Lambda configuration settings that were causing messages to be rejected multiple times, due to Lambda throttling. As a result, they could not be processed and ended up in the dead letter queue.

Avoiding Serverless Production Horrors

Both the Solita and the Segment stories underscore how important it is to test serverless applications and catch potential issues — some of which can be quite subtle — before they become horror stories in the production environment.

Other ways to ensure smooth serverless deployments include:

Know thy services: Although serverless simplifies infrastructure management, it is not plug-and-play. You need to understand the configuration details of the serverless services that you use and how they work under the hood. In addition, you must be aware of the impact of the pay-per-use model, so that you can control infrastructure costs.
Monitoring and alerting: Serverless is a relatively new paradigm, with many asynchronous events and parallel executions, plus a high abstraction layer that stymies visibility. You have to make sure that you know how to track the state of your services, how to identify performance issues, and how to spot execution problems — with alerts to help prevent unexpected infrastructure cost overruns.
Next-generation tooling: Specialized AI-driven platforms like Thundra provide the real-time visualization and tracing that are essential for monitoring, debugging, troubleshooting and securing serverless applications.

Emrah is VP of Product at Thundra. He is enthusiastic about serverless, observability and chaos engineering.