Retries are still a best practice for serverless architectures

Tamás Sallai

1 min

I watched Marc Brooker's talk on re:Invent 2024, Try again: The tools and techniques behind resilient systems (ARC403). In that, he talks about metastability and how retries can make a transient failure a systemic one.

His example is a spike in requests that overloads a server. Without retries, clients get errors, but after a while things go back to normal. The downside is that clients will see errors instead of just some delay.

Let's fix that by adding some retries!

Sounds reasonable, but now there is a bigger problem: the server will never recover. This is because an already overloaded server is getting even more traffic. According to Marc, this is the "effect behind some of the biggest outages of large-scale system over the history of the industry".

The rest of the talk is equally interesting (erasure coding for reducing tail latency? wow!), but I kept thinking about this.

I work primarily with serverless architectures and I think they are fundamentally different so that retries are not harmful to them as with a server-based architecture.

The advantage of a serverless architecture is that it has no practical upper limit in scalability. Maybe DynamoDB or Lambda needs some time to scale out, but eventually it will and then the increased traffic will be handled just fine. The system won't be stuck in an endless overloaded state.

There is a huge caveat here though. Lambda, S3, DynamoDB, AppSync, API Gateway, and similar services scale to infinity, but other parts might not. If the app uses any service, first-, or third-party, in that critical path that has an upper limit then suddenly all those nice scalability features are out of the window.

And the worst part? You won't even know until the crash happens.