Using Amazon Managed Grafana to troubleshoot a serverless application

Amazon Managed Grafana, is a fully managed service that makes visualizing and analyzing operational data at scale easier. Many customers choose Amazon Managed Grafana because of an existing investment in Grafana, its deep integration with vendors they might already be using, consolidation of metrics across environments, and powerful visualizations for both in-cloud and on-premises workloads.

In a previous article, we showed how to get started with Amazon Managed Grafana. In this blog post, we explain not only how to integrate Amazon CloudWatch logs and metrics with Amazon Managed Grafana, but also how to troubleshoot a serverless application built using Amazon API Gateway and AWS Lambda. We also show how to visualize, analyze, alarm, and notify on metrics and logs across multiple data sources, all from within Grafana. All source code for this blog post is available on GitHub.

Consider a serverless application that exhibits a spike of error responses, and you want to quickly search your logs to figure out the root cause. Without the proper observability tooling, this can be a time-consuming process. To troubleshoot an application issue, you can begin by graphing error metrics and then querying log data to narrow down the problem. Although this task can be accomplished natively in CloudWatch, many customers prefer to use Grafana because of its ability to uniquely visualize and combine insights from multiple open source, cloud, and third-party data sources without moving the data.

Grafana simplifies unified observability by integrating with CloudWatch metrics and logs, allowing you to build a dashboard to troubleshoot operational issues.

Scenario and sample application architecture

Our scenario is a web application that exposes a public API. This API occasionally produces errors for which we would like to determine the root cause. We will use Grafana to correlate application metrics and logs to identify the problem. To simulate this scenario, the sample application generates synthetic traffic (including errors), metrics, and logs.

Our sample application represents a simplified serverless architecture: a public backend API built using Lambda and API Gateway. We define our application using the AWS Serverless Application Model (SAM).

To generate logs and metrics for this application, we need to send some HTTP traffic to the API. We will use a traffic generator Lambda function to repeatedly invoke our API Gateway endpoint, randomly injecting a contrived request error via querystring:

GET https://xxxxxxxxxx.execute-api.us-east-1.amazonaws.com/?error=1

These errors will cause exceptions to be raised in our event handler Lambda function and will register in both CloudWatch metrics and logs. This process runs for 15 minutes, which is the maximum duration for a single Lambda function invocation; however, you can run the traffic generator function as often as you like. A full 15-minute run will generate a sufficient volume of metrics and logs for this exercise.

Building and deploying the sample application

To build and deploy the sample application, you will need an Amazon Managed Grafana workspace already deployed. You also need Git, Docker, and the SAM CLI installed. Docker is required to build the application without needing a full local installation of the specific Python version I use, along with any dependencies I’ve used.

git clone https://github.com/aws-samples/grafana-serverless-blog
cd grafana-serverless-blog

Build

Build the application using a Lambda-like Docker container:

sam build --use-container

Deploy

Deploy the application to your AWS account:

sam deploy --guided

This command will package and deploy your application to AWS, with a series of prompts:

Stack Name: The name of the stack to deploy to AWS CloudFormation. This should be unique to your account and region. We will use amg-blog throughout this sample.
AWS Region: The AWS region you want to deploy your app to.
Confirm changes before deploy: If set to yes, any change sets will be shown to you before execution for manual review. If set to no, the AWS SAM CLI will automatically deploy application changes.
Allow SAM CLI IAM role creation: Many AWS SAM templates, including this example, create AWS IAM roles required for the AWS Lambda function(s) included to access AWS services. By default, these are scoped down to minimum required permissions. To deploy an AWS CloudFormation stack that creates or modified IAM roles, the CAPABILITY_IAM value for capabilities must be provided. If permission isn’t provided through this prompt, to deploy this example you must explicitly pass --capabilities CAPABILITY_IAM to the sam deploy command.
For the prompt HttpHandlerFunction may not have authorization defined, Is this okay? [y/N], AWS SAM is informing you that the sample application configures an API Gateway API without authorization. When you deploy the sample application, AWS SAM creates a publicly available URL. You can safely acknowledge this notification by answering Y to the prompt. For information about configuring authorization, read Controlling access to API Gateway APIs.
Save arguments to samconfig.toml: If set to yes, your choices will be saved to a configuration file inside the project, so that in the future you can just re-run sam deploy without parameters to deploy changes to the application.

Note the output value for the GenerateTrafficFunction name, which will be of the form amg-blog-GenerateTrafficFunction-1234567ABCDEF:

CloudFormation outputs from deployed stack
-------------------------------------------------------------------
Outputs
-------------------------------------------------------------------
Key                 GenerateTrafficFunction
Description         Generate Traffic Lambda function name
Value               amg-blog-GenerateTrafficFunction-13ACNF91BKYRY
-------------------------------------------------------------------

Generate traffic

Now we will generate traffic to the API Gateway endpoint, which will invoke the HttpHandlerFunction Lambda function repeatedly for 15 minutes.

Invoke the Lambda function using the AWS CLI:

aws lambda invoke --function-name <your-function-name> --invocation-type Event /dev/null

Replace <your-function-name> with the output value you noted in the sam deploy step above.

We use an asynchronous (event-based) invocation here because we don’t retrieve a return value, and we send the output to /dev/null because we don’t need to work with it in a file. If you see its JSON output open in an editor, you can safely exit.

While the Lambda function runs to completion, we can switch to Grafana and create a CloudWatch data source and begin exploring our application’s metrics and logs.

Creating a CloudWatch data source

You can add CloudWatch as a data source by using the AWS data source configuration option in the Grafana workspace console. This feature simplifies adding CloudWatch as a data source by discovering your existing CloudWatch accounts and manages the configuration of the authentication credentials that are required to access CloudWatch. You can use this method to set up authentication and add CloudWatch as a data source, or you can manually set up the data source and the necessary authentication credentials using the same method that you would on a self-managed Grafana server.

To use AWS data source configuration, first use the Amazon Managed Grafana console to enable service-managed IAM roles that grant the workspace the IAM policies necessary to read the CloudWatch resources in your account or in your entire organizational units. Then use the AMG workspace console to add CloudWatch as a data source. For more information, read AMG permissions and policies for AWS data sources and notification channels.

Use AWS data source configuration, adding CloudWatch as a data source

Sign in to the Grafana workspace console using AWS SSO if necessary. In the left navigation bar in the Grafana workspace console, choose the AWS icon and then choose Data sources.

Select the CloudWatch service and then select the default Region that you want the CloudWatch data source to query from.

Select Add data source. CloudWatch will be shown in the provisioned data sources:

Now we are ready to begin exploring metrics and logs by creating a new dashboard.

Exploring metrics and logs

Creating a dashboard

Create a new dashboard by selecting the + sign in the left navigation bar and selecting Dashboard.

Now select Add new panel, select CloudWatch data source, leave Query Mode to CloudWatch Metrics and select the appropriate values as shown in the screenshot below. This will visualize the API Gateway metrics from CloudWatch Metrics.

You may need to adjust the relative time range to show the metrics captured while the sample application generates data. A traffic graph similar to the following should be displayed:

Here, the x-axis is time and the y-axis represents the number of HTTP requests to API Gateway. Notice how the duration matches the maximum 15-minute lifetime of the Lambda function request.

Now that we’ve generated a significant amount of traffic, let’s explore metrics to find errors.

Finding errors in CloudWatch metrics

Let’s edit our traffic graph; this time, we’ll examine our Lambda invocation error rate. Change the namespace the AWS/Lambda, the Metric Name to Errors, and the FunctionName to the HTTP handler Lambda function, similar to the following:

Apply these changes and you will get a graph of errors whose shape mirrors the traffic graph. The values will reflect about 5 percent of the total traffic.

Finding errors in CloudWatch logs

We can add an additional panel, this time setting CloudWatch Logs as our data source, and specifying the log group corresponding to our amg-blog-HttpHandlerFunction-*** log group. You can search and analyze log data using a specialized query syntax. For the query, use the following, and refresh the panel with Shift+Enter.

fields @timestamp, @message
| filter @message like /error/

You may be prompted to switch to table mode, which is what we’re looking for. Switch the visualization type at any time to table mode in the panel properties:

You should now get the Lambda function error logs displayed in the panel like the following:

Because we are using the AWS Lambda Powertools Python Logger facility, the entire log message is represented as a JSON object. By logging in this manner, we can then query on any of the nested fields in the JSON structure. Nested JSON is flattened using dot notation, us to access any attribute within the JSON structure.

Here, we can write a query to inspect the distribution of HTTP User-Agent headers in our error messages. Perhaps there’s a pattern to our errors related to the HTTP client making the request.

Modify the query we created above as follows. This will display the number of errors, sorted by HTTP User-Agent:

fields @timestamp, @message
| filter @message like /error/
| stats count(*) as Count by `message.headers.user-agent`

Notice we use the flattened dot notation in backticks. We should get a suspicious-looking User-Agent along with a count of errors.

As shown in the image, the User-Agent named Malicious Agent 1.0 is responsible for 100 percent of our errors, and about 5 percent of our overall traffic. You might notice in generate_traffic/app.py that I deliberately induce this error rate using the following block of code:

async def fetch(url, session):
    try:
        # Set 5% of invocations to error out due to a "bad actor"
        if random.randrange(0, 20) == 0:
            async with session.get(
                url,
                params={"error": "1"},
                headers={"User-Agent": "Malicious Agent 1.0"},
            ) ...

The HTTP handler Lambda function will raise an exception when it finds error=1 in the querystring:

def lambda_handler(event, context):
    # Simulate exception on "malicious" input
    if event["rawQueryString"] == "error=1":
        logger.error(event)
        raise Exception("Malicious input detected")

Now that we know how to identify the source of our errors, how can we be proactively notified when they occur? That’s where Grafana alerts and notifications can help.

Creating Grafana alerts and notifications

We can use alerts to identify problems in a system when they occur, helping to minimize disruptions to our services. Alerts consist of two parts: Alert rules, which define conditions that are regularly evaluated by Grafana, and a Notification channel, which defines how the alert is delivered. When the conditions of an alert rule are met, Grafana notifies the channels configured for that alert.

Let’s configure an alert to send a notification when a certain error threshold is reached. To add a notification to an alert rule, we first must add and configure a notification channel. Navigate to Alerting in the left navigation menu and select Notification channels.

Select Add channel on the next screen.

If you have enabled service-managed permissions and included Amazon Simple Notification Service (Amazon SNS) as a notification channel for your workspace, you only need to provide the SNS topic Amazon Resource Name (ARN) when you create your notification channel. The SNS topic must be prefixed with grafana for notifications to successfully publish to the topic. If you use customer-managed permissions, the IAM role that you supply should include SNS publish permissions for your SNS topic.

I’ve already configured an SNS topic called grafana-topic and subscribed my email address to it. Now you just need to supply your topic’s ARN to complete the setup:

Now we can configure alerts using this notification channel. Alerts are added and configured in the Alert Tab of any dashboard graph panel, letting us build and visualize an alert using existing queries.

Let’s add a graph panel to track Lambda invocation errors for our HTTP handler function. We’ll use this panel to create an alert for these errors. Create a panel using the Namespace AWS/Lambda, Metric Name Errors, and Stats Sum.

Ensure that the FunctionName is set to your HTTP handler function.

Switch to the Alert tab and select Create Alert to create an alert.

Rules have three components:

Name: Enter a descriptive name. The name will be displayed in the alert rule list.
Evaluate every: Specify how often the scheduler should evaluate the alert rule. This is referred to as the evaluation interval.
For: Specify how long the query needs to violate the configured thresholds before the alert notification triggers.

Here, we are using a query condition to determine when the alert is activated. This condition means: if the query A has an average value above 1000 between now and 1 minute ago. In other words, if there are more than 1000 errors per minute, set the alert and send a notification.

Save the panel.

Now we need to re-run our traffic generator, which will induce HTTP errors above the threshold we just set. Run the following AWS CLI command again:

aws lambda invoke --function-name <your-function-name> --invocation-type Event /dev/null

Again, replace the function name with your own. Wait a few minutes, and you should receive an email notification from SNS.

Conclusion

Amazon Managed Grafana is a powerful tool for analyzing your serverless application’s metrics and logs. In this article, we demonstrated how to deploy a sample serverless application and generate traffic to it. We then created a CloudWatch data source in Grafana and used that data source to explore metrics and logs generated. We then discovered errors in those metrics and logs and uncovered the root cause. Finally, we configured alerts and notifications for these errors.

When you’re done exploring this sample application, you may delete the resources you created in this blog to avoid ongoing charges. You can use the AWS CLI, AWS Management Consoles, or the AWS APIs to delete the CloudFormation stack deployed by SAM. You can also delete the CloudWatch logs for both the Lambda functions to avoid incurring charges there as well.