How to use DynamoDB batch write with retrying and exponential backoff

Make sure that every element in a batch request is saved

Author's image
Tamás Sallai
5 mins

The batchWriteItem operation

DynamoDB supports the batchWriteItem function that consolidates multiple item writes in a single request. It's a great performance improvement as network overhead is a significant factor with DynamoDB and sending fewer requests reduce the CPU (and network) utilization of the caller. And especially with Lambda, where you are billed for the running time of the function, this translates into savings on the infrastructure.

The batch write supports up to 25 write requests. They can be either a PutRequest or a DeleteRequest, and you can also mix them inside a single operation. This makes it suitable for mass writes and deletes.

Note that it differs from the transactWriteItems function. While both can contain multiple writes to different items, the batchWriteItem writes the items individually and does not provide atomicity for the operations. Think of batch writing as multiple individual writes/deletes in a single network call.

Errors during the write

batchWriteItem differs from individual writes/updates/deletes in how errors are handled. When you use, for example, a putItem, it either succeeds or fails, and the latter is indicated by an exception:

try {
	await ddb.putItem({TableName, Item: {/* ... */}}).promise();
	// the item is saved successfully
} catch (e) {
	// there was an error
}

The source of the error can be one of many things. Maybe the service is experiencing degradation, there is a problem with the request, or there was a problem writing the item. Errors occur and you need to prepare for them. And batching amplifies the probability of errors.

So, what happens when some of the items fail?

try {
	await ddb.batchWriteItem({/* ... */}).promise();

	// are all of the items saved successfully?
} catch (e) {
	// which items failed?
}

DynamoDB sends back an UnprocessedItems field in the response that indicates which items failed to write. The request still throws an exception if the error affects the whole batch, but if only some items are unprocessed, it will be successful.

try {
	const res = await ddb.batchWriteItem({/* ... */}).promise();

	// some items are written (but maybe not all of them)
	// res.UnprocessedItems
} catch (e) {
	// the whole batch failed
}

Retrying failed elements

The solution is to resend the batchWriteItem request with the unprocessed elements. And since the second batchWriteItem might also fail to process all items, a third attempt might be needed also. It's a best practice to put an upper limit on retries, after which the operation throws an exception so that the caller won't get stuck indefinitely.

As with all distributed systems, retrying should not happen immediately. AWS also recommends a backoff algorithm:

If DynamoDB returns any unprocessed items, you should retry the batch operation on those items. However, we strongly recommend that you use an exponential backoff algorithm. If you retry the batch operation immediately, the underlying read or write requests can still fail due to throttling on the individual tables. If you delay the batch operation using exponential backoff, the individual requests in the batch are much more likely to succeed.

Exponential backoff means the code waits longer and longer between retries. Increasing this time exponentially makes the first few tries in rapid succession while it reaches longer delays quickly.

For example, waiting 10 ms after the first request, then 20 ms after that, then increase to 40, 80, 160, 320, 640, 1280, and so on. After just 8 tries it waits more than a second.

Implementation

Let's make a safer batchWriteItem that implements retrying with an exponential backoff!

Fortunately, the structure of the UnprocessedItems in the response matches the RequestItems in the request, so there is no transformation needed.

This is a recursive algorithm that calls itself with the remaining elements while also keeping track of how many tries it made:

const wait = (ms) => new Promise((res) => setTimeout(res, ms));

const batchWrite = async (items, retryCount = 0) => {
	const res = await ddb.batchWriteItem({RequestItems: items}).promise();

	if(res.UnprocessedItems && res.UnprocessedItems.length > 0) {
		if (retryCount > 8) {
			throw new Error(res.UnprocessedItems);
		}
		await wait(2 ** retryCount * 10);

		return batchWrite(res.UnprocessedItems, retryCount + 1);
	}
};

And to use it, use the same RequestItems structure as you'd use for the batchWriteItem:

await batchWrite({
	"Table": [
		{
			PutRequest: {
				Item: {
					Key: {"S": "key1"},
					Value: {"S": "val1"},
				},
			}
		},
		{
			PutRequest: {
				Item: {
					Key: {"S": "key2"},
					Value: {"S": "val2"},
				},
			}
		}
	]
});

The if (retryCount > 8) part controls when the function gives up. You can increase this if you want more tries as that would make the function more likely to succeed. But keep in mind that there is a tradeoff between the probability of success and speed. Each retry attempt (especially with the exponential backoff) lengthens when an exception is finally thrown.

The exponential backoff is the await wait(2 ** retryCount * 10); part. It starts with 10 ms after the first try (the initial request) and doubles the wait for every subsequent request.

Conclusion

Writing elements in a batch increases the probability of some of them won't be saved. The batchWriteItem returns a success response in this case but it also indicates which elements weren't processed. Using that function without checking this response field runs the risk of skipping elements that should be written to the database.

A retrying algorithm with an exponential backoff makes batch writing a safer function as unprocessed elements are automatically sent again. Make sure to always handle skipped items.

January 12, 2021
In this article