How to clear a DynamoDB table

Name: S3 and CloudFront Signed URLs - How to handle files in an AWS serverless environment
Author: Tamás Sallai

Ways to remove all elements in a reliable way

Tamás Sallai

8 mins

Want to learn AWS serverless development? Click here

Deleting all items

Especially during testing, it's a common requirement to clear the database. There is a handy command in SQL for this: DELETE FROM table, which clears all items in the table. But DynamoDB does not support this statement and there is no direct alternative either.

How to delete all items in a DynamoDB table then?

It turns out there is no easy way to do this and reliably deleting all items requires quite a bit of coding. But there is an alternative way too: deleting and recreating the table. This gets rid of the items inside and it also has a big advantage the other solutions don't have: it's free. You don't need to issue a delete request to each item and pay for the write capacity it consumes. The problem with this solution is that you need to make sure you recreate the table exactly as it were, with all the attributes and indices. And that includes all future features too. Failing to do so might break the app.

But let's consider a more traditional approach, one that does not modify the table only the items inside it!

The basic structure is simple: get all the items, then send a delete request for each of them. This could be a few lines of code, but the way DynamoDB and its API works makes it a bit more complicated.

Getting all elements

The scan operation reads the entire table and returns all the items. It consumes read capacity units for every element, so it is usually a wasteful operation that indicates that an index is missing. But when you need all the elements, it's the operation to use.

By default, it returns all fields of every item. But the delete operation requires only the keys (hash and sort keys). To make the scan transfer less data over the wire, it supports a ProjectionExpression that allows returning only a subset of attributes for each item. Note that it does not make the scan consume less read capacity, only saves bytes transferred on the wire.

But which attributes are the keys?

To know this, use the describeTable operation that returns this information:

const getKeyDefinitions = async (table) => {
	const tableInfo = (await ddb.describeTable({TableName: table}).promise()).Table;
	return tableInfo.KeySchema.map(({AttributeName, KeyType}) => {
		return {
			AttributeName,
			AttributeType: tableInfo.AttributeDefinitions.find((attributeDefinition) => attributeDefinition.AttributeName === AttributeName).AttributeType,
			KeyType,
		};
	});
};

const clearTable = async (table) => {
	const keys = await getKeyDefinitions(table);
	// [{AttributeName: "keyAttribute", AttributeType: "S", KeyType: "HASH"}, ...]

	// ...
}

The keys value contains all the necessary information for the keys.

The next step is to make a scan request and get the keys for all items. This is again a bit more complicated as scan is a paginated operation. It may not return all elements, in which case it has a LastEvaluatedKey field in the response. To get more elements, use the ExclusiveStartKey property for the next request.

How to paginate the AWS JS SDK using async generators

The AWS SDK returns the lists in batches. Learn how to use async generators to hide this

const _ = require("lodash/fp");

const getPaginatedResults = async (fn) => {
	const EMPTY = Symbol("empty");
	const res = [];
	for await (const lf of (async function*() {
		let NextMarker = EMPTY;
		while (NextMarker || NextMarker === EMPTY) {
			const {marker, results} = await fn(NextMarker !== EMPTY ? NextMarker : undefined);

			yield* results;
			NextMarker = marker;
		}
	})()) {
		res.push(lf);
	}

	return res;
};

const clearTable = async (table) => {
	const keys = await getKeyDefinitions(table);

	const allItems = await getPaginatedResults(async (LastEvaluatedKey) => {
		const items = await ddb.scan({
			TableName: table,
			ExclusiveStartKey: LastEvaluatedKey,
			ProjectionExpression: keys.map((_k, i) => `#K${i}`).join(", "),
			ExpressionAttributeNames: _.fromPairs(keys.map(({AttributeName}, i) => [`#K${i}`, AttributeName])),
		}).promise();

		return {
			marker: items.LastEvaluatedKey,
			results: items.Items,
		};
	});
	// allItems contains all the elements
}

The allItems value contains the keys for all elements.

Batch deletion

The next step is to issue a deleteItem for each key gathered in the previous step. Fortunately, DynamoDB supports the batchWriteItem operation that allows more than one DeleteRequest in a single call, reducing the HTTP overhead significantly.

The batchWriteItem supports batches of size 25. This means a request has to be made for every 25 items. To make these batches, lodash provides the chunk convenience function:

const _ = require("lodash/fp");

const clearTable = async (table) => {
	const keys = // ...;
	const allItems = // ...;

	const batches = _.chunk(25)(allItems);
	// [[item1, item2, ..., item25], [item26, item27, ...]]
}

With these batches, the next step is to prepare the arguments for the batchWriteItem. It supports deleting from multiple tables and it needs the types and values for all keys. An example request is:

{
	"TableName": [
		{
			"DeleteRequest": {
				"Key": {
					"HashKey": {"S": "key1"},
					"RangeKey": {"N": 5}
				}
			}
		}
	]
}

Programmatically, the code below generates this structure for each batch:

const clearTable = async (table) => {
	const keys = // ...;
	const allItems = // ...;

	const batches = _.chunk(25)(allItems);

	batches.map((batch) => {
		[table]: batch.map((obj) => {
			return {
				DeleteRequest: {
					Key: _.flow(
						_.map(({AttributeName}) => {
							return [AttributeName, obj[AttributeName]];
						}),
						_.fromPairs,
					)(keys)
				}
			};
		});
	});
}

Finally, sending the batchWriteItem operations also needs some extra plumbing. Since they contain multiple independent requests without a guarantee that all of them will be successful, a partial apply can happen. In this case, some items are deleted and some are not. DynamoDB indicates which requests in a batch are unsuccessful in the UnprocessedItems field in the response. To make sure that all the items are processed, we need to implement a retrying strategy.

How to use DynamoDB batch write with retrying and exponential backoff

Make sure that every element in a batch request is saved

const wait = (ms) => new Promise((res) => setTimeout(res, ms));

// retries the batch until all elements are written
const batchWrite = async (items, retryCount = 0) => {
	const res = await ddb.batchWriteItem({RequestItems: items}).promise();

	if(res.UnprocessedItems && res.UnprocessedItems.length > 0) {
		if (retryCount > 8) {
			throw new Error(res.UnprocessedItems);
		}
		await wait(2 ** retryCount * 10);

		return batchWrite(res.UnprocessedItems, retryCount + 1);
	}
};

const clearTable = async (table) => {
	const keys = // ...;
	const allItems = // ...;

	const batches = // ...;

	await Promise.all(batches.map((batch) => {
		// send the batch write
		return batchWrite({
			[table]: batch.map((obj) => {
				return {
					DeleteRequest: {
						Key: _.flow(
							_.map(({AttributeName}) => {
								return [AttributeName, obj[AttributeName]];
							}),
							_.fromPairs,
						)(keys)
					}
				};
			})
		});
	}));
}

The full implementation

It's quite a bit of coding, but this code implements a generic way to reliably delete all items in a table:

const AWS = require("aws-sdk");
const _ = require("lodash/fp");

const ddb = new AWS.DynamoDB();

// get all results, paginating until there are no more elements
const getPaginatedResults = async (fn) => {
	const EMPTY = Symbol("empty");
	const res = [];
	for await (const lf of (async function*() {
		let NextMarker = EMPTY;
		while (NextMarker || NextMarker === EMPTY) {
			const {marker, results} = await fn(NextMarker !== EMPTY ? NextMarker : undefined);

			yield* results;
			NextMarker = marker;
		}
	})()) {
		res.push(lf);
	}

	return res;
};

const wait = (ms) => new Promise((res) => setTimeout(res, ms));

// writes a batch of items, taking care of retrying the request when some elements are unprocessed
const batchWrite = async (items, retryCount = 0) => {
	const res = await ddb.batchWriteItem({RequestItems: items}).promise();

	if(res.UnprocessedItems && res.UnprocessedItems.length > 0) {
		if (retryCount > 8) {
			throw new Error(res.UnprocessedItems);
		}
		await wait(2 ** retryCount * 10);

		return batchWrite(res.UnprocessedItems, retryCount + 1);
	}
};

// returns the keys for a table
const getKeyDefinitions = async (table) => {
	const tableInfo = (await ddb.describeTable({TableName: table}).promise()).Table;
	return tableInfo.KeySchema.map(({AttributeName, KeyType}) => {
		return {
			AttributeName,
			AttributeType: tableInfo.AttributeDefinitions.find((attributeDefinition) => attributeDefinition.AttributeName === AttributeName).AttributeType,
			KeyType,
		};
	});
};

// clears a table
const clearTable = async (table) => {
	// get the key definitions
	const keys = await getKeyDefinitions(table);

	// get all items
	const allItems = await getPaginatedResults(async (LastEvaluatedKey) => {
		const items = await ddb.scan({
			TableName: table,
			ExclusiveStartKey: LastEvaluatedKey,
			ProjectionExpression: keys.map((_k, i) => `#K${i}`).join(", "),
			ExpressionAttributeNames: _.fromPairs(keys.map(({AttributeName}, i) => [`#K${i}`, AttributeName])),
		}).promise();
		return {
			marker: items.LastEvaluatedKey,
			results: items.Items,
		};
	});

	// make batches (batchWriteItem has a limit of 25 requests)
	const batches = _.chunk(25)(allItems);

	// send the batch deletes
	await Promise.all(batches.map((batch) => {
		return batchWrite({
			[table]: batch.map((obj) => {
				return {
					DeleteRequest: {
						Key: _.flow(
							_.map(({AttributeName}) => {
								return [AttributeName, obj[AttributeName]];
							}),
							_.fromPairs,
						)(keys)
					}
				};
			})
		});
	}));
};

Caveats

The above solution removes all elements in the table, but there are edge cases and things to note when using it.

First, it reads all the items in the table. This incurs read capacity costs. Then it writes all items, which use write capacity. Deleting a table this way costs more the more elements the table has.

Then this operation is not atomic. It first reads all the items, then deletes them. If items are written to the table in the meantime, they won't be deleted.

Also, it loads all keys into memory, which is a finite resource. It is practical up to several thousand (or potentially a few million) items, but after that, you need to implement a streaming version.