How to use DynamoDB batch write with retrying and exponential backoff
Make sure that every element in a batch request is saved
The batchWriteItem
operation
DynamoDB supports the batchWriteItem
function that consolidates multiple item writes in a single request. It's a great performance improvement as network
overhead is a significant factor with DynamoDB and sending fewer requests reduce the CPU (and network) utilization of the caller. And especially with Lambda,
where you are billed for the running time of the function, this translates into savings on the infrastructure.
The batch write supports up to 25 write requests. They can be either a PutRequest
or a DeleteRequest
, and you can also mix them inside a single
operation. This makes it suitable for mass writes and deletes.
Note that it differs from the transactWriteItems
function. While both can contain multiple writes to different items, the batchWriteItem
writes the
items individually and does not provide atomicity for the operations. Think of batch writing as multiple individual writes/deletes in a single network call.
Errors during the write
batchWriteItem
differs from individual writes/updates/deletes in how errors are handled. When you use, for example, a putItem
, it either succeeds or
fails, and the latter is indicated by an exception:
try {
await ddb.putItem({TableName, Item: {/* ... */}}).promise();
// the item is saved successfully
} catch (e) {
// there was an error
}
The source of the error can be one of many things. Maybe the service is experiencing degradation, there is a problem with the request, or there was a problem writing the item. Errors occur and you need to prepare for them. And batching amplifies the probability of errors.
So, what happens when some of the items fail?
try {
await ddb.batchWriteItem({/* ... */}).promise();
// are all of the items saved successfully?
} catch (e) {
// which items failed?
}
DynamoDB sends back an UnprocessedItems
field in the response that indicates which items failed to write. The request still throws an exception if the
error affects the whole batch, but if only some items are unprocessed, it will be successful.
try {
const res = await ddb.batchWriteItem({/* ... */}).promise();
// some items are written (but maybe not all of them)
// res.UnprocessedItems
} catch (e) {
// the whole batch failed
}
Retrying failed elements
The solution is to resend the batchWriteItem
request with the unprocessed elements. And since the second batchWriteItem
might also fail to process
all items, a third attempt might be needed also. It's a best practice to put an upper limit on retries, after which the operation throws an exception so that
the caller won't get stuck indefinitely.
As with all distributed systems, retrying should not happen immediately. AWS also recommends a backoff algorithm:
If DynamoDB returns any unprocessed items, you should retry the batch operation on those items. However, we strongly recommend that you use an exponential backoff algorithm. If you retry the batch operation immediately, the underlying read or write requests can still fail due to throttling on the individual tables. If you delay the batch operation using exponential backoff, the individual requests in the batch are much more likely to succeed.
Exponential backoff means the code waits longer and longer between retries. Increasing this time exponentially makes the first few tries in rapid succession while it reaches longer delays quickly.
For example, waiting 10 ms after the first request, then 20 ms after that, then increase to 40, 80, 160, 320, 640, 1280, and so on. After just 8 tries it waits more than a second.
Implementation
Let's make a safer batchWriteItem
that implements retrying with an exponential backoff!
Fortunately, the structure of the UnprocessedItems
in the response matches the RequestItems
in the request, so there is no transformation needed.
This is a recursive algorithm that calls itself with the remaining elements while also keeping track of how many tries it made:
const wait = (ms) => new Promise((res) => setTimeout(res, ms));
const batchWrite = async (items, retryCount = 0) => {
const res = await ddb.batchWriteItem({RequestItems: items}).promise();
if(res.UnprocessedItems && res.UnprocessedItems.length > 0) {
if (retryCount > 8) {
throw new Error(res.UnprocessedItems);
}
await wait(2 ** retryCount * 10);
return batchWrite(res.UnprocessedItems, retryCount + 1);
}
};
And to use it, use the same RequestItems
structure as you'd use for the batchWriteItem
:
await batchWrite({
"Table": [
{
PutRequest: {
Item: {
Key: {"S": "key1"},
Value: {"S": "val1"},
},
}
},
{
PutRequest: {
Item: {
Key: {"S": "key2"},
Value: {"S": "val2"},
},
}
}
]
});
The if (retryCount > 8)
part controls when the function gives up. You can increase this if you want more tries as that would make the function more likely
to succeed. But keep in mind that there is a tradeoff between the probability of success and speed. Each retry attempt (especially with the exponential
backoff) lengthens when an exception is finally thrown.
The exponential backoff is the await wait(2 ** retryCount * 10);
part. It starts with 10 ms after the first try (the initial request) and doubles the wait
for every subsequent request.
Conclusion
Writing elements in a batch increases the probability of some of them won't be saved. The batchWriteItem
returns a success response in this case but it
also indicates which elements weren't processed. Using that function without checking this response field runs the risk of skipping elements that should be
written to the database.
A retrying algorithm with an exponential backoff makes batch writing a safer function as unprocessed elements are automatically sent again. Make sure to always handle skipped items.