Edge cases in CloudFormation

Examples for when the declarative infrastructure management doesn't work that well

Tamás Sallai

5 mins

Photo by energepic.com: https://www.pexels.com/photo/close-up-photo-of-iphone-tempered-glass-288479/

Want to learn AWS serverless development? Click here

CloudFormation's promise

CloudFormation is a declarative infrastructure management tool, meaning you describe what you want, upload, and the service will figure out what changes it needs to make to match reality with the configuration.

This works well for some resources. For example, a Lambda function with behind an API Gateway works as expected: it can be deployed, destroyed, updated, and rolled back without issues. Resources like these are the poster children of CloudFormation: "look how easy it is to describe your infrastructure! And CloudFormation will keep things in sync".

But most of my time is spent working around no-so-nice resource types. Let's see a couple of them!

Book

Building GraphQL APIs with AWS AppSync

How to design, implement, and deploy GraphQL-based APIs on the AWS cloud

Examples

S3 buckets

An S3 bucket with objects can't be deleted. The CDK supports that via the autoDeleteObjects field, but that deploys a Lambda function behind the scenes that deletes the objects.

Usually, it just means that there will be some garbage left in the account.

DynamoDB GSI

Then you can't add more than 1 GSI to a DynamoDB table in a single update. There is a ticket for this but it's unclear it will ever get resolved.

The problem here is that you update develop a lot of times and it works. Then you try to update production and it fails because there were 2 changes that both add a new GSI.

IAM roles

Then there are globally unique resources, for example, IAM Roles.

Usually, roles are created with a unique name, but there are exceptions to this. An S3 bucket replication needs a role but in the CDK it's optional and a new one will be created if an existing one is not provided. The problem is that it uses a hardcoded name of CDKReplicationRole, meaning if that role happens to exist in any region in the account then the deployment will fail. And that can happen for many reasons: maybe another stack is deployed that uses S3 replication or just something created it for some reason. The result is the same: deployment failure.

AppSync resolvers

Then in AppSync you can't deploy a resolver that has no corresponding field in the schema.

Well, it sounds reasonable but that means there is a race condition between updating the schema and adding a new resolver. It is more of a nuisance than a problem as it can be solved with adding an explicit dependency between the resolver and a schema, but still it breaks the promise of "describe what you need and CloudFormation handles the rest".

IoT DomainConfiguration

The worst require a two-step deployment. For example, changing the ServerCertificateArns of an AWS::IoT::DomainConfiguration can not be done in one step. This is because:

Currently you can't update the server certificate in your domain configuration. To change the certificate of a domain configuration, you must delete and recreate it.

In effect, that means if you use a certificate, for example the DnsValidatedCertificate construct that is deprecated, and want to use a different one, such as a cross-region Certificate, you can't migrate without a downtime. New deployments with either the old or the new certificate will work but moving from one to the other won't.

The solution is a two-step deployment:

Start with one certificate configured
Delete the certificate and the domain configuration with it
Deploy with the other certificate configured

This is wrong. Not only that two deployments at once is a problem that requires a departure from usual practices (but the first one leaves the system in a broken state as without a domain configuration the devices can't connect to the cloud.

Worse still, this creates a "barrier commit" as you can not move from "before" to "after" or in reverse without deploying that. Instead of having a history of the system where you can recreate any point, you have these commits to keep in mind. Also, it needs communication to the developers as well: everybody will need to go through the barrier and do a two-phase deployment in their dev accounts.

Also, this will prevent you from changing the logical resource ID of the certificate. Do you want to move it to a nested stack, or move it to a different CDK construct? You have to go through the same process again.

Cognito LogDeliveryConfiguration

This is probably the worst I've encountered so far because it couldn't be updated in-place. But AWS keeps making the mistake of enforcing uniqueness which then breaks resource recreates.

The latest I saw is the Cognito LogDeliveryConfiguration. It is configured for a UserPool and if one is configured for a pool then another one can not be added. As a result, you can't move the resource around without first deleting it. You put it somewhere in your code and it stays there forever.

Conclusion

While many of the resources CloudFormation supports are "good citizens" most of my time is spent working around the edge cases.

Looking into the future, I don't see this changing either: new resources are getting added with the same limitations and existing cases are seldom fixed. Probably we'll have the same limitations and need to use the same workarounds going forward.