CloudFront cache problems and how to solve them
How a misconfigured cache can lead to errors and security problems
CloudFront is a global CDN that sits between the visitors and the backend servers, called origins. It speeds up connections and also offers edge caching. Caching, especially when it happens close to the users, can provide an enormous speedup over non-cached requests. But caching can also easily introduce new problems as well as it complicates the architecture.
In this article, we'll look into the usual problems that happen with CloudFront, and also what causes those problems and how to solve them.
The examples use an environment that you can reproduce in your own account with Terraform. Clone the code from the GitHub repo and deploy the stack:
terraform init
terraform apply
export DOMAIN=$(terraform output -raw domain)
Don't forget to delete the resources after you are done:
terraform destroy
Stale objects
By far the most common problem is when an object is returned from the cache when it shouldn't. This is also easy to spot as users see outdated content and even refreshing the page won't solve it.
To test it, the origin returns the time of the request:
$ curl -s https://$DOMAIN | jq
{
...
"time": "22/Sep/2021:07:41:42 +0000"
}
A later request returns the same timestamp indicating that the response was from the CloudFront cache and did not reach the origin:
$ curl -s https://$DOMAIN | jq
{
...
"time": "22/Sep/2021:07:41:42 +0000"
}
Cause
The problem is usually in the cache TTL settings. If the objects are cached when they shouldn't be or cached longer than appropriately cause objects to be "stuck".
How to solve it
Usually, objects fall into one of four categories in regards to caching:
- Per-user dynamic requests
- Anonymous dynamic requests
- Non-revved static files
- Revved static files
Per-user dynamic requests are unique to a user, such as the notifications you have on Facebook or the emails on Gmail. These are usually not cacheable and each request needs to go to the backend.
Anonymous dynamic requests are user-independent but dynamic requests. For example, a news portal shows the same breaking news to all the users. These are dynamic requests as they come from a database of some sorts but they are the same for all visitors. Caching these requests for a short amount of time, say 1-5 seconds greatly speeds up the webapp with a minimal impact on the freshness of the content.
Revving means that the file's name contains its hash. This is when WebPack renames app.js
to app.d587bbd6e38337f5accd.js
. This is implemented in
most bundlers and it's a best practice to take advantage of it.
Non-revved static files, for example, index.html
and favicon.ico
should not be cached or cached for a very short time. The problem with caching
these assets surfaces only when you change them, such as when you deploy a new version of a web application. Since that is a (relatively) rare event and
it usually "works on my machine" it leads to hard-to-notice problems.
Revved static files on the other hand can be (and should be) cached for a very long time.
To solve stalled caching, first identify which category the object falls into and set the cache time accordingly.
Information disclosure
Misconfigured caching can lead to security problems too. When a user can access responses that were meant for another user it can lead to expose sensitive information.
Let's say there is a secret value sent in the query string by one user. The backend receives the value and might return protected information:
$ curl -s "https://$DOMAIN/info_exposure/query?secret=value" | jq
{
...
"queryStringParameters": {
"secret": "value"
}
}
But when a different user who does not know the secret value makes a request, they get back the same response:
$ curl -s "https://$DOMAIN/info_exposure/query" | jq
{
...
"queryStringParameters": {
"secret": "value"
},
}
If it affects cookies than that's an easy way for impersonation attacks:
$ curl -s "https://$DOMAIN/info_exposure/session" -H "Cookie: sessionid=id" | jq
{
...
"cookies": [
"sessionid=id"
]
}
$ curl -s "https://$DOMAIN/info_exposure/session" | jq
{
...
"cookies": [
"sessionid=id"
],
}
Cause
To understand what's causing this problem we need to first talk about two concepts: the cache key and the origin request.
Cache key
Caching uses cache keys to determine if a request is cached or not. The cache key is calculated from the request and the configuration determines which parts of the request are included. For example, the cache key might contain the query parameters but no headers.
When CloudFront receives a request it calculates the cache key. Then it checks its caches to see if the cache key has a response. If there is a stored response, it returns that. If there is no such value, it forwards the request to the origin and stores the response.
What is included in the cache key determines what requests are considered the same and which ones are considered different.
CloudFront supports Cache Policies that control what (query parameters, headers, cookies) goes into the cache key calculation. This allows fine-grained control over how the edge caching works.
Origin request
CloudFront is a proxy but that does not mean that requests are passing through it without modifications. CloudFront can remove query parameters and cookies, and remove and add headers.
With the new Origin Request Policy, you can setup what elements of the request are forwarded to the origin. This allows fine-grained control over what is available on the origin.
When the cache key and the origin request config both ignore or include a value then there is no potential security issue. But when an element is forwarded but ignored by the cache, there can be problems.
When one user provides a value and that value is missing from the cache key, then a later request without that value gets back the same response. If the backend makes a decision based on that value (such as a session ID or a token) then the second user might gain access to information meant only for the first user.
How to solve it
As the cause of the problem is that the origin request includes more elements than the cache key, the solution is to change the Origin Request Policy to include everything that is in the Cache Key Policy.
Missing values
Since CloudFront includes or omits the request elements sent to the origin, a misconfigured Distribution might remove things that the origin needs to process the request.
For example, this request includes a query parameter and a cookie, but none of them is visible on the backend:
$ curl -s "https://$DOMAIN/missing_values/v?a=b" -H "Cookie: a=b" | jq
{
"headers": {
...
},
"time": "22/Sep/2021:07:46:52 +0000"
}
Cause
In a CloudFront Distribution, individual Cache Behaviors can be configured what they should include in the request.
The "legacy cache settings" allow only some control over the cache key and the origin requests. The newer, policy-based approach allows more fine-grained control but I'm yet to find a use-case for this more complicated setup.
How to solve it
In this case, the query parameters and the cookies are not forwarded, so it's easy to fix: change the Cache Behavior to include them in the origin request.
But deciding on what to include and what to omit is usually not straightforward and requires knowledge of what the backend needs. Including too much leads to inefficient caching. Omitting something that should be included leads to bugs.
As a rule of thumb, distinguish between two types of requests:
- Static files
- Dynamic requests
For static files, don't include any query parameters, headers, or cookies, and add only the things that you know you'll need.
For dynamic requests, start with including all query parameters and cookies, and no headers. If you use token-based authorization (such as JWT) then add
the Authorization
header.
Overwritten Host header
There is a specific header that can cause problems for many backend services: the Host header. For example, API Gateway requires that the Host matches the endpoint URL and will return a Forbidden error otherwise.
The Cache Behavior for this path does not forward the Host header to the origin:
$ curl -s "https://$DOMAIN/" | jq
{
"headers": {
"host": "sp9waenbfc.execute-api.eu-central-1.amazonaws.com",
...
},
...
}
But the one for /headers
does and the origin gets the Host that matches the CloudFront distribution and not the origin domain:
$ curl -s "https://$DOMAIN/headers" | jq
{
"headers": {
"Host": "d3ukyl73fwibnn.cloudfront.net",
...
}
}
In this example, the origin is the httpbin.org site that has no restrictions on the Host value. But API Gateway won't work with this config. This is a surprisingly common error that comes up frequently with AWS-based origins.
Cause
The Cache Behavior is setup to forward all headers and that includes the Host. That means CloudFront uses the value from the request the browser sent to CloudFront and that has the domain of the Distribution. On the other end, the origin expects this value to be the domain of the origin.
How to solve it
If CloudFront is configured not to forward the Host header it will use the origin's host. That means the request will be like the one sent directly to the origin and not through a proxy.
Conclusion
Configuring caching requires a careful analysis of the backends. You'll need to consider what elements of the requests the origins need and what you can safely omit. Also, you need to consider the cache TTLs and what is a tolerable time for them.