I'm changing my mind about serverless

Tamás Sallai

4 mins

I keep track of an "ideal architecture", one that I would use if tasked to design a new system from scratch. For several years now this was AWS serverless. The AWS part is personal: this is the stack I'm most familiar with. And serverless because it works the same for small and for large. It is a magical feeling to do a terraform apply and see that all the different parts are coming live, ready to serve whatever load coming its way. A well-designed serverless application combines the best of all worlds: the cost scales with traffic and there is no upper ceiling.

Now? I'm not sure anymore.

When I started as a professional developer, the general consensus was:

Compute is expensive so we need elasticity. When the traffic is lower, we can shut down machines so we don't pay for peak capacity
A single machine puts a ceiling on scale so we need horizontal scalability.
Physical failures happen so we need resilient systems.

Now it seems like these are not true anymore for the vast majority. I read the One Big Server Is Probably Enough post and then the The Small Data Manifesto and at some point I checked how much a dedicated server would cost. A server with 192 cores, 3.1 TB RAM, and 25 Gbps unmetered bandwidth is ~$5k. In terms of cost, that's not a significant expense for a team of developers, and in terms of capacity (to borrow words from Gemini): this server is a tank and it will be bored. And this is the top, one that can probably handle a small country, a smaller server costs less.

How does that change the calculations?

Compute is cheap, it's viable to plan for 10 times the peak. There is no need for elasticity
There is a ceiling but it's so high that unless you are Cloudflare or Google you won't hit it.
I don't think most services need higher reliability than what a single hardware or a setup of a primary-secondary can provide

Of course, backup, disaster recovery plans, and monitoring are still needed. I'd design a system that takes periodic backups offsite and also asynchronously replicate data as it comes. But these don't need fancy tools.

Moreover, I think the necessary reliability is way lower than most people think. "We need 5 9s!", I've heard. Last October AWS and next month Cloudflare were down for hours bringing down a big part of the internet. I'm not saying downtimes are good. But setting expectations too high is unproductive. Once every 3 years the server is down and it takes 30 minutes to fail over to the secondary? An update brings down the service for 30 seconds? I believe these are entirely acceptable values for most cases.

What are the upsides? No eventual consistency, local reproducibility, easy debugging, instant restarts. Things that silently decrease productivity. Also, a more complex setup increases the chance that logical errors happen, such as the ones that caused the AWS and the Cloudflare incidents.

I'll explore the alternatives in the future. What I have in mind is a Linux box with Postgres started from a NixOS configuration file that can be rigorously tested before deployment. There are several challenges here that need to be solved, such as tenant isolation, replication, and a lot of configuration, but I think when set up properly this can be a base for a more developer-friendly environment.

One downside is that it needs a lot of rigour to keep the upsides and this is where I saw these setups go wrong in the past: someone SSHs into the machine, changes some configuration or installs some packages and forgets to update the code for that. Or a process writes files to the filesystem instead of the database and suddenly the replication does not include everything. These small things then accumulate until the point that nobody dares to touch the system anymore. This is why I'm particularly interested in learning about NixOS: everything is code, there are no one-off changes that are quickly forgotten. It can merge the advantages of IaC and local development.

I'd still use a serverless architecture for things that are less likely to change, things that are bounded in scope. My backup solution, for example, will stay serverless as it benefits from the cost structure of S3 and needs only a small monitoring function. But for a product that the team is actively iterating on, I'm exploring alternatives.