Fast Cloudfront log queries using AWS Athena and Serverless

Ben Hoyt
Compass True North
Published in
4 min readDec 2, 2019

--

Recently we’ve needed to run queries on our Cloudfront web logs, for example, to determine the number of Googlebot hits to certain page types. We configured Cloudfront to write logs to an S3 bucket, and we set up an AWS Athena table to query those logs.

(By the way, if you haven’t used Athena before, I think it’s nothing short of amazing — the ability to use SQL to query fields from a bunch of gzipped text files sitting in S3? That’s incredible.)

But over time our queries became really slow. For every query, Athena had to scan the entire log history, reading through all the log files in our S3 bucket. This churned through a lot of data (about 120 GB) and made the queries slow and pretty expensive — taking over 65 seconds to run and costing about $0.70 per query.

So we did some searching and came across this AWS blog post, which describes how to set up Athena partitioning to vastly speed up date-based queries. We decided to use AWS Lambda to partition our logs, along with the Serverless framework to easily deploy the Lambda code.

Partitioning

Under the hood, Athena uses the Presto distributed SQL engine and Apache Hive, which can make use of directory-based partitioning of your data. If you arrange your files in a special directory layout, namely k1=v1/k2=v2/filename, then you can use k1 and k2 as filters in the WHERE clause, and Athena will only have to scan through the relevant files instead of everything.

For example, when querying web logs you almost always want to use a date range, querying for only the last 7 or 30 days. We can break up our logs into year, month, and day partitions and vastly reduce our search space. Our Cloudfront logs arrive at:

s3://bucket/raw/E1234567890123.2019-10-30-00.f22296ba.gz 

All we need is a program that moves the log files from the original directory to a partition directory, for example:

s3://bucket/partitioned/year=2019/month=10/day=30/E1234...gz

Now if we run a query that includes something like WHERE year = '2019' AND month = '10' it’ll only scan the necessary fraction of our log files. Originally a query might look like this (and take 69 seconds to run):

SELECT count(*) as total,
count(CASE WHEN status = 301 THEN 1 ELSE NULL END) as n301,
count(CASE WHEN status = 404 THEN 1 ELSE NULL END) as n404,
count(CASE WHEN status = 500 THEN 1 ELSE NULL END) as n500
FROM "cloudfront_logs"."unpartitioned"
WHERE date >= DATE('2019-10-28') AND date <= DATE('2019-11-03') AND
user_agent LIKE '%Googlebot%';

Now the same query needs year/month/day filtering and looks like this:

SELECT count(*) as total,
count(CASE WHEN status = 301 THEN 1 ELSE NULL END) as n301,
count(CASE WHEN status = 404 THEN 1 ELSE NULL END) as n404,
count(CASE WHEN status = 500 THEN 1 ELSE NULL END) as n500
FROM "cloudfront_logs"."production_partitioned"
WHERE year = '2019' AND ((month = '10' AND day >= '28') OR
(month = '11' AND day <= '03')) AND
user_agent LIKE '%Googlebot%';

The query stats went down to 5.2 seconds and 6.3 GB. That’s a 13x improvement in speed and a 19x improvement in query cost.

In the original AWS blog entry, they also added an “hour” partition — we decided not to do this, as the most fine-grained we get is querying by day, and there is some indication from various results we’ve seen that having too many partitions can bog things down.

The Lambda function

The AWS blog post demonstrates a Lambda function written using the Node.js runtime, but as the rest of our code for this project was written in Python, we decided to use the Python runtime (Python 3.7).

The Lambda function is called whenever Cloudfront stores a new log file in the log bucket. The function determines the year/month/day partition values based on the filename, and moves the file to the correct place. We use a regex to match on the original Cloudfront log path:

# Example log path: raw/E1234567890123.2019-10-30-00.f22296ba.gz
FILENAME_RE = re.compile(
r"^.*\.(\d{4})-(\d{2})-(\d{2})-(\d{2})\.\w+\.gz$"
)

Once the Lambda handler has matched the filename, it uses boto3 to copy the file to the partitioned destination:

s3_client.copy_object(
CopySource={"Bucket": bucket, "Key": key},
Bucket=dest_bucket,
Key=dest_key,
)

and then deletes the original:

s3_client.delete_object(Bucket=bucket, Key=key)

Finally it runs an Athena ALTER TABLE query to add the partition to the Athena metadata store (if the partition doesn’t already exist):

ALTER TABLE {ATHENA_TABLE}
ADD IF NOT EXISTS PARTITION
(year = '{year}', month = '{month}', day = '{day}');

The full code for our Lambda handler is below:

Serverless

We used the Serverless framework to make it easier to configure and deploy the Lambda function — once you’ve set up your serverless.yaml config file, it’s as simple as typing serverless deploy. (Or add --stage=production: we have separate Lambda functions and Cloudfront log buckets for staging and production.)

Our Serverless config is as follows. I’ve truncated the IAM role statements for brevity:

Creating the Athena table

Before turning on the Lambda, we also need to create the table and its partition configuration in Athena. Here’s the full CREATE TABLE statement that we used:

Wrapping up

We decided to address the low-hanging fruit (partitioning) to get a significant increase in query speed for our web log queries. We reduced our average query time from 69 seconds to 5.2 seconds (more than 10x), and decreased cost by almost 20x.

There are more improvements we could make in future, for example changing the logs to Parquet format (a columnar, binary format) for more efficient reading. However, we decided that a more-than-10x improvement was a great start, and we can always optimize further in the future!

If you’re interested in this kind of work and other challenges, we’re hiring!

--

--