Using worker pools in NodeJs

How to implement true parallelization

Author's image
Tamás Sallai
6 mins
Photo by Pixabay: https://www.pexels.com/photo/top-view-of-bees-putting-honey-56876/

Why workers

JavaScript has a single thread model, which means that whatever code you write will be run by only one CPU core. It is nicely encapsulated in this quote:

everything runs in parallel, except your code

This makes programming easier as you no longer need to worry about things like "what happens between these two lines?" because it is guaranteed that no other code will be run. A click handler writes an object? Because of the single-thread model it can not interfering with other code, avoiding a massive headache that is present in most other languages.

"So I don't have to worry about code accessing the same data structures at the same time?"

You got it! That's the entire beauty of JavaScripts single-threaded / event loop design!

Impact on performance

In practice, most NodeJS code is limited by not the CPU but by waiting for other things: a webserver needs to send a network request to the database to fetch some data or to read some file from the filesystem. In the code, these are short-running operations that kick off some parallel code. Even though the JavaScript code is single threaded it can drive many cores.

That's the case for most backend applications: a request comes in, it needs some validation, a few calls to the database, then some transformation, and finally, send back the response. The percentage to run the JavaScript code compared to serving the whole request is very small, making this setup able to serve many requests in parallel.

But in other cases the single thread is a limitation. When the JavaScript code is doing a lot of CPU-intensive tasks, such as parsing documents, then it's common to see ~20% utilization even though the code is computing as fast as it could. Here, the limiting factor is the single CPU core that the JavaScript code can use.

Worker threads

Adding concurrency into the language is not simple. While it seems straightforward to add the ability to start a new thread, similar to other languages, that would go counter to the simplified programming model of NodeJS. To keep the single-threadness but also add the ability for parallel processing, NodeJS provides worker threads.

Each worker has its own context and behaves as normal JS code: it has a single thread running independently from other workers. This solves the CPU utilization problem, but communication between threads becomes tricky. Sending data between threads can not be based on shared memory as that would nullify the simplified programming model. Instead, workers need to rely on events to send data to each other.

The way a worker can communicate with other workers or the main thread is via postMessage calls. Then the other end can add a listener that gets called sometime after a message is received:

// index.mjs
import {Worker} from "node:worker_threads";

const worker = new Worker("./worker.mjs");
worker.addListener("message", (msg) => {
	// handle message
});
// worker.mjs
import {parentPort} from "node:worker_threads";

parentPort.postMessage("Hello world!");

Using a worker pool

While it's possible to run worker threads all doing their own things and sometimes sending messages to each other, this is not how usually they are used. Instead, workers are used to offload CPU-intensive tasks, making their lifecycle tied to the operation. For example, if the main thread needs to parse an HTML document and find various elements in it, it can create a new worker, hand off the processing to it, and then terminate the worker when it's done.

For this use-case it's better to have a pool of workers waiting for tasks and have a request-response-style communication between the main thread and the worker. This way, a pool manager can schedule tasks to available workers, keep track of a queue, and also start and stop threads when needed.

While worker threads are built-in into Node, managing a pool requires either custom code or a library. I searched for existing projects that provide this functionality and there are a few:

I decided to go with Piscina, mainly because it only supports NodeJS worker threads so probably there is less mental overhead needed for it.

An efficient way to implement workers is to separate the relevant code into 2 parts:

  • the runner code which is used by the main thread
  • the worker code that is the glue code between the worker and the rest of the codebase

Worker runner

First, create a pool of workers:

import Piscina from "piscina";

export const pool = Piscina.isWorkerThread ? undefined : new Piscina({
	filename: path.resolve(__dirname, "worker.js"),
});

To make sure that the worker won't spawn more workers, effectively fork-bombing the process, it checks whether it is imported by the main thread or not. As the pool can be undefined, it needs a check whenever it's used.

Worker code

The worker.ts then contains the functions that are the API of the worker. For example, I had a validateFile function that I wanted to run in the worker:

import {validateFile as validateFileOrig} from "./validate-file.js";

export const validateFile = async ({baseUrl, url, res, roles}: {...}) => {
	return validateFileOrig(baseUrl, url, res, roles);
};

Notice that the worker's function has only 1 argument: the object with the baseUrl, the url, res, and roles, but the original function has these as separate ones. This is because Piscina allows passing only one argument to the worker function.

Calling the worker

To call the function in the worker:

import {pool} from "./worker-runner.js";
import {validateFile} from "./worker.js";

const allDocumentErrors = await pool.run(
	{baseUrl, url, res, roles} as Parameters<typeof validateFile>[0],
	{name: validateFile.name}
) as ReturnType<typeof validateFile>;

The nice thing is that type hints make this call type-safe. The parameters are checked using the Parameters<typeof validateFile>[0] and the compiler will throw an error if it does not match. Then the result value is casted to the correct type with the ReturnType<typeof validateFile>. With these, whenever the function in the worker.ts changes, all the usages will result in a compile-time error.

Transferable objects

The postMessage uses the structured clone algorithm, which is a better version of JSON.parse(JSON.stringify(...)). It supports circular references, Dates, TypedArray, Sets, Maps, and a few other types.

What is missing is functions. That means while it's better than plain JSON, it's still limited to rather simple types.

February 6, 2024
In this article