Custom resources in CloudFormation templates: Lessons learned

Using custom resources requires some insight how CloudFormation works. Here are some tips to avoid the common pitfalls

10 mins
I have a lot of challenges when it comes to AWS, but I bet your pain points are entirely different than mine. I'd love to hear what keeps you up at night. It would be great to hear from you by filling out this form. Thanks in advance!

Background

Working with custom resources opens up a new dimension of CloudFormation. Along with the built-in support for most AWS resources, you can add support to all sorts of other things. This also removes the limitation that CloudFormation can only handle resources in the AWS cloud; you can manage GitHub repositories, MailChimp campaigns, and many other third-party resources.

During my work with custom resources, I’ve learned some best practices.

Use a Lambda for the ServiceToken

You can choose between an SNS topic and a Lambda function when you implement the logic to handle the lifecycle. The former is somewhat more versatile, but I’ve found that using a Lambda also contained in the template to be less problematic. The resources tend to create/update/delete well within the time limit of the execution, and having everything defined in a single place eliminates the guesswork which code is run.

Can’t change the ServiceToken

If you have a Custom resource, you can’t change the ServiceToken later. If you do need to change it, remove the resource from the template, and create a new one with the new token. This is usually not a problem, as by the time you deploy to production you already know which Lambda to use, but it came up once.

As a consequence, you can’t rename the Lambda resource as it would change its Arn, which is what the ServiceToken is.

When you use a template like this:

Resources:
  CustomLambda:
    ...
  CustomRes:
    Type: Custom::Res
    Properties:
      ServiceToken: !GetAtt CustomLambda.Arn

And later want to rename the Lambda resource, without changing anything else the update will fail:

Resources:
  AnotherCustomLambda:
    ...
  CustomRes:
    Type: Custom::Res
    Properties:
      ServiceToken: !GetAtt AnotherCustomLambda.Arn

Only the newest code will be run

If you create a custom resource with a given Lambda code, when you push new versions only the new code will run. This can easily mean orphaned resources, especially when you change how you locate them.

For example, if the custom resource is an S3 object, then you change how you calculate the key. In this case, even though the OldResourceProperties contains the old value, the new Lambda code will not be able to locate and delete the object. As an illustration, when you change the property from Filename to Key, you might need to check the bucket manually.

In the previous version, you used Filename as the key of the object:

Resources:
  S3File:
    Type: Custom::S3File
    Properties:
      Filename: example

But later changed it to Key:

Resources:
  S3File:
    Type: Custom::S3File
    Properties:
      Key: example

The updated code will look for Key and will not find the object defined by the Filename. As a result, the existing object might be unaccounted for in the future.

Add a dummy parameter to force updates

Changes in the Properties section triggers an update to the resource. But if you only change the Lambda code, that does not change the resource itself, so you won’t see any effects.

During development, you should add a dummy parameter that you can change to force an update.

With a resource like this:

Resources:
  CustomRes:
    Type: Custom::Res
    Properties:
      ...
      Dummy: 1

When you change the Lambda code, make sure to change the parameter also:

Resources:
  CustomRes:
    Type: Custom::Res
    Properties:
      ...
      Dummy: 2

This will trigger an update lifecycle step and you’ll be able to observe the effects of the new code.

Stuck create/update/delete

Since the handler function is fully asynchronous, without an explicit failure, it timeouts eventually. The bad news is that it can take 30-60 minutes, in which time you can’t even delete the stack.

On the bright side, it will timeout eventually preventing getting stuck indefinitely. Moreover, if the stack deletion fails, CloudFormation offers the exclusion of the problematic resources. Just don’t forget to delete them manually.

Always add error handling first

Because of the timeout, always add error handling first, right after you start writing your function. This safety mechanism makes sure an exception does not result in a lengthy timeout.

The easiest way is to write your function in a try-catch:

exports.index = async (event, context) => {
	try {
		...
	}catch(e) {
		sendFailure(event, context, e);
	}
}

Use variable naming for the underlying resource

Instead of naming the underlying resource directly from the template, you should also add a variable part. This makes sure that if you have two resources in the template, they map to different underlying resources.

Instead of naming the object example.txt, use something like example-gg3jfds.txt. You can see this in the built-in resources too, for example, when you create an S3 bucket.

It comes especially important when you rename the resource in the template. CloudFormation first creates the new resource, then issues a delete to the old one. And since the parameters are the same, except for the LogicalResourceId, this process will delete the underlying resource, even though it is still in the template.

Don’t use the official example for the PhysicalResourceId

The PhysicalResourceId is the identifier you give to the resource during the create step. During an update, you have a chance to return a new one, in which case there will be a delete step after the update. Since this is something you need to plan, you need to be mindful what you use for this id.

But whatever you choose, do not use the official example of context.logStreamName, which is unfortunately got copy-pasted to many other tutorials.

The name of the log stream is not tied to the lifecycle of the resource in any ways, and it can change without notice. If, for example, you deploy a new Lambda version, it will change. You don’t want to delete your resources the next time you update your stack.

If you don’t want to give it much thought, use the LogicalResourceId. This makes sure that there will be no deletion unless you explicitly delete the resource from the template, which should be a good starting point.

cfn-response module

The cfn-response module defines a function that implements the basic logic of sending the response to CloudFormation. The catch is, it might not be available, depending on how you defined the code in the template.

If you use the ZipFile property, the cfn-response module is available. But if the code is in S3, which is where aws cloudformation package puts it, the module is not available. In that case you can copy-paste from the official docs

Download our ebook on AWS account security basics

Learn 5 simple steps to avoid the rookie mistakes.

  1. Why the root account is bad for security
  2. Use multiple users
  3. Secure accounts with multi-factor authentication
  4. Security logging with CloudTrail
  5. Billing alerts as an early warning system

Download the free guide here:

08 January 2019