Monitor Amazon ElastiCache for Redis (cluster mode disabled) read replica endpoints using AWS Lambda, Amazon Route 53, and Amazon SNS

In Amazon ElastiCache for Redis, your applications use the provided endpoints to connect to an ElastiCache node or cluster. According to Amazon ElastiCache for Redis Components and Features in the ElastiCache for Redis User Guide, a multiple-node Redis (cluster mode disabled) cluster has two kinds of endpoints:

“The primary endpoint always connects to the primary node in the cluster, even if the specific node in the primary role changes. Use the primary endpoint for all writes to the cluster. The read endpoint in a Redis (cluster mode disabled) cluster always points to a specific node. Whenever you add or remove a read replica, you must update the associated node endpoint in your application.”

The primary endpoint feature in ElastiCache for Redis offers the consistency in always resolving the primary node endpoint. AWS customers greatly appreciate this feature:

“A useful feature of ElastiCache Redis is the availability of a primary endpoint, which always points to the current primary node for a cluster. This endpoint does not change, even when the cluster experiences a failover, which means that the application does not need to change when the primary node changes. This feature is particularly beneficial in the event of an auto-failover.” —An AWS customer

As a best practice and for workload balancing, you should direct read requests to read replicas. However, if a failover occurs, a previously used read replica could be promoted to the primary role. Continuing to direct read requests to the same endpoint can increase the load on the new primary (the old read replica). In such cases, it is useful to have a read replica endpoint that always points to a replica, even after a failover.

You can do this by setting up an AWS Lambda function that can monitor and update read replica endpoints. The idea is to create and use custom CNAMEs for each of the read replicas in a private zone in Amazon Route 53 and then use these CNAMEs in the Redis client.

If a failover occurs, a notification is pushed to an Amazon Simple Notification Service (Amazon SNS) topic. A Lambda function listening on this SNS topic accordingly updates the CNAMEs with the appropriate read replica’s endpoints. As a result, your Redis client always has CNAMEs pointing to the read replica’s endpoints in addition to the normal ElastiCache primary endpoint for your write operations on the primary node.

This post describes the steps to create an AWS Lambda function that listens to Amazon SNS and updates the CNAMEs used for the read replicas of an Amazon ElastiCache for Redis cluster (cluster mode disabled).

Solution overview

The structure of this solution is as follows:

Client application

In this example, we use the ElastiCache primary endpoint for the writes. For the reads, we use five read replicas with custom CNAMEs:

readonly1.private.redisdub.pl.
readonly2.private.redisdub.pl.
readonly3.private.redisdub.pl.
readonly4.private.redisdub.pl.
readonly5.private.redisdub.pl.

Amazon ElastiCache

We select a dedicated SNS topic to the ElastiCache cluster (cluster mode disabled) that has one primary node and five replicas, as shown in this example:

Amazon Route 53

We create a DNS private zone (that is, private.redisdub.pl), and in this zone we use the following CNAMEs:

readonly1.private.redisdub.pl — testdns-002.6advcy.0001.euw1.cache.amazonaws.com
readonly2.private.redisdub.pl — testdns-003.6advcy.0001.euw1.cache.amazonaws.com
readonly3.private.redisdub.pl — testdns-004.6advcy.0001.euw1.cache.amazonaws.com
readonly4.private.redisdub.pl — testdns-005.6advcy.0001.euw1.cache.amazonaws.com
readonly5.private.redisdub.pl — testdns-006.6advcy.0001.euw1.cache.amazonaws.com

AWS Identity and Access Management (IAM)

We have an IAM role for the Lambda function to provide the necessary permission to execute the function:

AmazonElastiCacheReadOnlyAccess
AWSLambdaBasicExecutionRole
RedisReplica_Route53 (custom policy for Route 53 because you only need two API calls)

AWS Lambda

The Lambda function is listening on the cluster’s SNS topic. Each time there is an event in this SNS topic, the function detects if it’s a failover or if a read replica has been added or removed. If one of these three events occurs, the Lambda function runs an API call to get the last structure of the Redis cluster (elasticache.describe_replication_groups).

Based on the response, the function executes another API call to update or create CNAMEs in your Route 53 private zone (route53.change_resource_record_sets). If it’s a failover, it updates your existing “Read” CNAMEs. If it’s a node creation or deletion, it adds or removes a CNAME accordingly.

In this scenario, the application always initiates read operations against read replicas in addition to the writes being executed on the primary node via the primary endpoint.

Results and benchmark

In the following test, we run a cron job on five client instances, executing the following Redis benchmark command:

redis-benchmark -n 10000 -k 0 -h readonly1.private.redisdub.pl -p 6379

-n 10000 executes 10,000 requests.
-k 0 reconnects for each request.

-h readonly1.private.redisdub.pl indicates to connect to one of the CNAMEs created for replicas.

Each of the five clients is targeting one unique CNAME:

readonly1.private.redisdub.pl
readonly2.private.redisdub.pl
readonly3.private.redisdub.pl
readonly4.private.redisdub.pl
readonly5.private.redisdub.pl

The following screenshot shows the Amazon CloudWatch metric NewConnections showing the requests that are generated by the benchmark being equally distributed across the read replicas:

If we look closely at this CloudWatch metric, we can see a failover triggered at 16:00 and the primary testdns-001 failing over testdns-002.

We can see testdns-002 receiving the requests from the benchmark. At 16:00, when the failover is triggered, the number of requests drops because the CNAME records have been updated. Then testdns-002 becomes the new primary and stops receiving requests through the CNAME for read operations readonly1.private.redisdub.pl.

Before the 16:00 failover:

Primary endpoint –> primary testdns-001

readonly1.private.redisdub.pl –> replica testdns-002

After the 16:00 failover:

Primary endpoint –> new primary testdns-002

readonly1.private.redisdub.pl –> replica testdns-001

As in a normal failover scenario, the previous primary node testdns-001 is replaced. We can see that as soon as it comes up and is running, it starts receiving requests from the benchmark because readonly1.private.redisdub.pl is now pointing to testdns-001:

Note that testdns-001 was potentially able to receive requests from 16:06, but the next execution of the benchmark was at 16:10. Thus, there is the flat brown line between 16:06 and 16:10.

Detailed steps to implement this tool

Step 1: Create a private zone in Route 53 for the virtual private cloud (VPC) where your Redis cluster and the clients are located.

For details, see Creating a Private Hosted Zone in the Amazon Route 53 Developer Guide.

In addition, you need to create the CNAMEs for your read replica; for example:

read1.myredis.com — The endpoint of your current Redis replica Node1
read2.myredis.com — The endpoint of your current Redis replica Node2

You can create as many CNAMEs as there are existing read replicas.

After creating the CNAMEs, you can use them in your client application to validate.

Step 2: Add an SNS topic to your cluster.

For details, see Managing ElastiCache Amazon SNS Notifications in the Amazon ElastiCache for Redis User Guide.

Step 3: Create an IAM role for the Lambda function with the following policies:

AmazonElastiCacheReadOnlyAccess
AWSLambdaBasicExecutionRole

Create a new policy RedisReplica_Route53 with the following content:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt1511707556511",
      "Action": [
        "route53:GetHostedZone",
        "route53:ChangeResourceRecordSets"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:route53:::hostedzone/Z32WVXIKNNRFKK"
    }
  ]
}

To learn more about IAM policies and roles, see the following topics:

Step 4: Create a Lambda function using the following code.

On the AWS Lambda console, choose Create function, and then choose Author from scratch. Name your function RedisReplica-autocname.

In the Role list, choose the option Choose an existing role. Then choose the role that you created in step 3.

Choose Create function. You will see a page with three tabs: Configuration, Triggers, and Monitoring. On the Configuration tab, for Runtime, choose Python 2.7. Then copy the following code and paste it in the editor:

from  __future__ import print_function

import boto3
import re
import json
import os

AWS_REGION = os.environ['aws_region']
CNAME = os.environ['cname']
ZONE_ID = os.environ['zone_id']
CLUSTER = os.environ['cluster']


def aws_session(role_arn=None, session_name='my_session'):
    """
    If role_arn is given assumes a role and returns boto3 session
    otherwise return a regular session with the current IAM userFailoverComplete/role
    """

    if role_arn:
        client = boto3.client('sts')
        response = client.assume_role(
            RoleArn=role_arn, RoleSessionName=session_name)
        session = boto3.Session(
            aws_access_key_id=response['Credentials']['AccessKeyId'],
            aws_secret_access_key=response['Credentials']['SecretAccessKey'],
            aws_session_token=response['Credentials']['SessionToken'])
        return session
    else:
        return boto3.Session()


def get_nodes(cluster, session):
    """
    return list of nodes that breaks down a cluster
    """

    elasticache = session.client('elasticache', region_name=AWS_REGION)
    repgroups = elasticache.describe_replication_groups()['ReplicationGroups']
    nodes = {}
    for repgroup in repgroups:
        if repgroup['ReplicationGroupId'] == cluster:
            for nodegrp in repgroup['NodeGroups']:
                for cachecluster in nodegrp['NodeGroupMembers']:
                    nodes[cachecluster['CacheClusterId']] = {}
                    nodes[cachecluster['CacheClusterId']
                          ]['role'] = cachecluster['CurrentRole']
                    nodes[cachecluster['CacheClusterId']
                          ]['addr'] = cachecluster['ReadEndpoint']['Address']
    return(nodes)


def update_cname(nodes, cname, zone, session):
    """
    update CNAME entries from a dictionary of nodes.
    """

    route53 = session.client('route53')
    dzone = route53.get_hosted_zone(Id=zone)
    dzonedomain = dzone["HostedZone"]["Name"]

    """ CNAME should be a valid zone's sud-domain """
    if not re.match('[a-zA-Z\d-]{,63}(\.[a-zA-Z\d-]{1,63})*\.' + dzonedomain, cname):
        return('Error, cname {} doesnt match domain {}'.format(cname, dzonedomain))

    response = {}
    num = 1
    for node_name in nodes.keys():
        node = nodes[node_name]
        if node['role'] == 'replica':
            realcname = '.'.join(
                [i + str(num) if enum == 0 else i for enum, i in enumerate(cname.split('.'))])
            dns_changes = {
                'Changes': [
                    {
                        'Action': 'UPSERT',
                        'ResourceRecordSet': {
                            'Name': realcname,
                            'Type': 'CNAME',
                            'TTL': 10,
                            'ResourceRecords': [
                                {
                                  'Value': node['addr'],
                                }
                            ],
                        }
                    }
                ]
            }
            print(
                "DEBUG - Updating Route53 to create CNAME {} for {}".format(realcname, node['addr']))
            response[node_name] = route53.change_resource_record_sets(
                HostedZoneId=zone,
                ChangeBatch=dns_changes
            )
            num += 1
    return(response)


def lambda_handler(event, context):
    """
    Main lambda function
    Parse and check the event validity
    """

    msg = json.loads(event['Records'][0]['Sns']['Message'])
    msg_type = msg.keys()[0]
    msg_event = msg_type.split(':')[1]
    msg_node = msg[msg_type]

    events = ['CacheNodeReplaceComplete', 'TestFailoverApiCalled',
              'FailoverComplete', 'CacheClusterProvisioningComplete']

    if msg_event not in events:
        print('Event {} is not valid for RedisReplica-autocname function'.format(msg_type))
        return
    else:
        print(
            'Event {} is valid, processing with RedisReplica-autocname...'.format(msg_type))

    session = aws_session()
    nodes = get_nodes(CLUSTER, session)

    if msg_node not in [node for node in nodes.keys()]:
        print('{} not a node of cluster {}'.format(msg_node, CLUSTER))
        return

    dnsupdate = update_cname(nodes, CNAME, ZONE_ID, session)

    """ dnsupdate return list when OK and string on error """
    if isinstance(dnsupdate, str):
        print(dnsupdate)
        return

    for response in dnsupdate.iteritems():
        print("DNS record {} R53 status is {}".format(
            response[0], response[1]['ChangeInfo']['Status']))
    return

For more information about creating a function, see Create a Lambda Function in the AWS Lambda Developer Guide.

The next step is to set up the four environment variables (variable name/value). They are key-value pairs and should be set up as follows:

cluster: The name of your Redis cluster.
zone_id: You can get this information in Route 53 when you choose the Private zone (the ID is visible in the right pane).
aws_region: The AWS Region of your Redis cluster.
cname: The CNAME structure that you want to use for your read replicas. For example, if you want to use read1.myredis.com, read2.myredis.com, and so on for your CNAMEs, enter “read.myredis.com.” (note the period (.) at the end of the CNAME). This CNAME is automatically incremented when a new node is created and a record associated.

In the same Configuration tab on the console, set the timeout for the function to control the code execution performance. We recommend that you set this timeout to 15 seconds. (In my test, the execution time was about 3 seconds, but it might vary based on your environment.)

Finally, on the Trigger tab, choose SNS, and then choose the topic that is associated with your Redis cluster.

Step 5: Test your environment.

The final step is to test the environment by executing a manual failover. This failover triggers an event in the SNS topic, and the Lambda function detects the failover. Consequently, it collects the new primary/replicas mapping and updates the CNAMEs in your private zone.

As soon as the Time to Live (TTL) has expired (15 seconds for a private hosted zone in Amazon EC2), the client instances pick up the new DNS records. They connect to the new read replica (old primary) to execute the read operations without requiring any further changes in your application.

Another test is to add a new node in your replication group. The Lambda function automatically creates a new CNAME. If you already have read1.myredis.com and read2.myredis.com, it creates read3.myredis.com, and you can add this new CNAME in your application. The function always keeps the same number of CNAMEs as the number of read replicas. So if you remove a node, it removes a CNAME.

About the Lambda function

The Lambda function is listening directly on the SNS topic and proceeds to different verifications in order to filter the following:

The event-type
The cluster-id

Be aware that each message in the SNS topic triggers the Lambda function, but only relevant messages trigger an action. To avoid this extra cost caused by extra execution, we recommend that you use a dedicated SNS topic for your Redis cluster.

Summary

The primary endpoint feature in ElastiCache for Redis provides consistency by always connecting to the primary node in the cluster, even if the specific node in the primary role changes. When you are directing read requests to read replicas, it can be useful to have a read replica endpoint that always points to a replica, even after a failover. In this post, we described how to build an AWS Lambda function that can monitor and update read replica endpoints. After you implement this process, your Redis client will have CNAMEs pointing to the read replica’s endpoints in addition to the normal ElastiCache primary endpoint for your write operations on the primary node.

With the practical knowledge that you have gained throughout this post, you can re-use the structure of this solution for other projects. You can change the event-type used to filter the Lambda function execution and add your own code to execute in response to an ElastiCache event.

About the Authors

Yann Richard is an AWS Cloud Support Engineer and ElastiCache Service Matter Expert. On a more personal side, his goal is to make data transit in less than 4 hours and run a marathon in sub-milliseconds, or the opposite.

Julien Prigent is a Linux Cloud Support Engineer for AWS. He likes to explore the limit of his stamina, whether it be a technical deep dive session or a long distance trail run.

AWS Database Blog