Using worker pools in NodeJs
How to implement true parallelization
Why workers
JavaScript has a single thread model, which means that whatever code you write will be run by only one CPU core. It is nicely encapsulated in this quote:
everything runs in parallel, except your code
This makes programming easier as you no longer need to worry about things like "what happens between these two lines?" because it is guaranteed that no other
code will be run. A click
handler writes an object? Because of the single-thread model it can not interfering with other code, avoiding a massive headache
that is present in most other languages.
"So I don't have to worry about code accessing the same data structures at the same time?"
You got it! That's the entire beauty of JavaScripts single-threaded / event loop design!
Impact on performance
In practice, most NodeJS code is limited by not the CPU but by waiting for other things: a webserver needs to send a network request to the database to fetch some data or to read some file from the filesystem. In the code, these are short-running operations that kick off some parallel code. Even though the JavaScript code is single threaded it can drive many cores.
That's the case for most backend applications: a request comes in, it needs some validation, a few calls to the database, then some transformation, and finally, send back the response. The percentage to run the JavaScript code compared to serving the whole request is very small, making this setup able to serve many requests in parallel.
But in other cases the single thread is a limitation. When the JavaScript code is doing a lot of CPU-intensive tasks, such as parsing documents, then it's common to see ~20% utilization even though the code is computing as fast as it could. Here, the limiting factor is the single CPU core that the JavaScript code can use.
Worker threads
Adding concurrency into the language is not simple. While it seems straightforward to add the ability to start a new thread, similar to other languages, that would go counter to the simplified programming model of NodeJS. To keep the single-threadness but also add the ability for parallel processing, NodeJS provides worker threads.
Each worker has its own context and behaves as normal JS code: it has a single thread running independently from other workers. This solves the CPU utilization problem, but communication between threads becomes tricky. Sending data between threads can not be based on shared memory as that would nullify the simplified programming model. Instead, workers need to rely on events to send data to each other.
The way a worker can communicate with other workers or the main thread is via postMessage
calls. Then the other end can add a listener that gets called
sometime after a message is received:
// index.mjs
import {Worker} from "node:worker_threads";
const worker = new Worker("./worker.mjs");
worker.addListener("message", (msg) => {
// handle message
});
// worker.mjs
import {parentPort} from "node:worker_threads";
parentPort.postMessage("Hello world!");
Using a worker pool
While it's possible to run worker threads all doing their own things and sometimes sending messages to each other, this is not how usually they are used. Instead, workers are used to offload CPU-intensive tasks, making their lifecycle tied to the operation. For example, if the main thread needs to parse an HTML document and find various elements in it, it can create a new worker, hand off the processing to it, and then terminate the worker when it's done.
For this use-case it's better to have a pool of workers waiting for tasks and have a request-response-style communication between the main thread and the worker. This way, a pool manager can schedule tasks to available workers, keep track of a queue, and also start and stop threads when needed.
While worker threads are built-in into Node, managing a pool requires either custom code or a library. I searched for existing projects that provide this functionality and there are a few:
- Piscina
- workerpool
- and a couple of others with various popularity and updates
I decided to go with Piscina, mainly because it only supports NodeJS worker threads so probably there is less mental overhead needed for it.
An efficient way to implement workers is to separate the relevant code into 2 parts:
- the runner code which is used by the main thread
- the worker code that is the glue code between the worker and the rest of the codebase
Worker runner
First, create a pool of workers:
import Piscina from "piscina";
export const pool = Piscina.isWorkerThread ? undefined : new Piscina({
filename: path.resolve(__dirname, "worker.js"),
});
To make sure that the worker won't spawn more workers, effectively fork-bombing the process, it checks whether it is imported by the main thread or not. As the
pool
can be undefined, it needs a check whenever it's used.
Worker code
The worker.ts
then contains the functions that are the API of the worker. For example, I had a validateFile
function that I wanted to run in the
worker:
import {validateFile as validateFileOrig} from "./validate-file.js";
export const validateFile = async ({baseUrl, url, res, roles}: {...}) => {
return validateFileOrig(baseUrl, url, res, roles);
};
Notice that the worker's function has only 1 argument: the object with the baseUrl
, the url
, res
, and roles
, but the original function
has these as separate ones. This is because Piscina allows passing only one argument to the worker function.
Calling the worker
To call the function in the worker:
import {pool} from "./worker-runner.js";
import {validateFile} from "./worker.js";
const allDocumentErrors = await pool.run(
{baseUrl, url, res, roles} as Parameters<typeof validateFile>[0],
{name: validateFile.name}
) as ReturnType<typeof validateFile>;
The nice thing is that type hints make this call type-safe. The parameters are checked using the Parameters<typeof validateFile>[0]
and the compiler will
throw an error if it does not match. Then the result value is casted to the correct type with the ReturnType<typeof validateFile>
. With these, whenever
the function in the worker.ts
changes, all the usages will result in a compile-time error.
Transferable objects
The postMessage
uses the structured clone algorithm, which is a better version of
JSON.parse(JSON.stringify(...))
. It supports circular references, Dates, TypedArray, Sets, Maps, and a few other types.
What is missing is functions. That means while it's better than plain JSON, it's still limited to rather simple types.