How to implement a persistent file-based cache in Node.Js
Speed up calculations between restarts
Caching results
Caching is a powerful way to do a process only once and thus speed up an application. For example, generate images from a PDF yields the same result for the same input file, so there is no need to run the costly process from scratch every time. Saving previous results and reusing them when appropriate can take an application that takes ages to run and make it a quick one.
Caching in memory is a good starting place. It works by keeping the previous results in a variable so that it's available the next time a costly process runs. But as memory is cleared when the process exits, it can not reuse results between restarts.
File-based caching is a good solution for this. Files are persistent across restarts, providing a durable place to store results. But they also come with an extra set of problems.
Structure
All file-based caching follows this general structure:
const cacheDir = // where to put cache files
const cacheKey = // calculate cache key for the input
const cacheFile = path.join(cacheDir, cacheKey);
if (exists(cacheFile)) {
// the result is cached
return fs.readFile(cacheFile);
}else {
// calculate the result and store it
const result = // run the process
await fs.writeFile(cacheFile, result);
return result;
}
It calculates the cache key and the cache directory, then checks if there is a file in that place. If there is, it reads the contents (cache hit), if there is none, then it calculates the result then writes the cache file (cache miss).
Let's break down each part!
Cache directory
The first question is: where to store the cache files? A good cache directory is excluded from version control, and it is removed from time-to-time.
There is an attempt to standardize a persistent cache location for Node.js applications in node_modules/.cache
. It has an advantage over /tmp
that it
survives machine restarts, while it is in the node_modules
directory that is usually recreatable using the package-lock.json
.
The find-cache-dir package provides an easy-to-use way to locate the cache directory.
To initialize and get the cache directory, use this code:
const findCacheDir = require("find-cache-dir");
const {promises: fs, constants} = require("fs");
const getCacheDir = (() => {
const cacheDir = thunk();
let prom = undefined;
return () => prom = (prom || (async () => {
await fs.mkdir(cacheDir, {recursive: true});
return cacheDir;
})());
})();
This uses the async lazy initializer pattern to create the directory only when needed.
Cache key
All caching depends on a good cache key. It must be known before running the calculation and must be different when the output is different. And, of course, should be the same when the output is the same.
I found it a best practice to hash the parts before concatenation then hash the result again. Since hashing makes a fixed-length string, it is resistant to
concatenation problems (such as "ab" + "c" === "a" + "bc"
).
const crypto = require("crypto");
const sha = (x) => crypto.createHash("sha256").update(x).digest("hex");
What should be in the cache key? The input data is an obvious candidate, but unlike memory-based caching, some descriptor of the process should also be included. This is to make sure that new versions of the packages invalidate the caches.
For example, when I needed to cache the results of a PDF-to-images process, I needed to get the version of the external program that did the calculations
(pdftocairo
). It provides a version()
call that calls the process with the -v
flag to print its version.
But not only the external program influences the result but also the Node.js package. Its version is in the package.json
.
The getVersionHash()
function returns the hash of these versions:
const pjson = require("./package.json");
const {version} = require("node-pdftocairo");
const getVersionHash = (() => {
let prom = undefined;
return () => prom = (prom || (async () => sha(sha(await version()) + sha(pjson.version)))());
})();
The cache key is the version hash and the source hash: sha(await getVersionHash() + sha(source))
.
Cache file
The cache file is the cache directory and the cache key:
// source is the input
const cacheFile = path.join(await getCacheDir(), sha(await getVersionHash() + sha(source)));
Handle caches
First, the cache logic needs to determine whether the result is cached or not. This is a check whether the file exists or not:
const fileExists = async (file) => {
try {
await fs.access(file, constants.F_OK);
return true;
}catch(e) {
return false;
}
};
if (await fileExists(cacheFile)) {
// read and return
}else {
// calculate and write
}
If the result is a single file or value, it's easy to handle the two cases:
if (fileExists(cacheFile)) {
// the result is cached
return fs.readFile(cacheFile);
}else {
// calculate the result and store it
const result = // run the process
await fs.writeFile(cacheFile, result);
return result;
}
Cache multiple files
Storing multiple results is also possible, just zip what you want to cache and write the archive to the cache. I prefer the JSZip library to handle archiving in Javascript:
const JSZip = require("jszip");
const stream = require("stream");
const util = require("util");
const finished = util.promisify(stream.finished);
if (await fileExists(cacheFile)) {
const file = await fs.readFile(cacheFile);
const zip = await JSZip.loadAsync(file);
const files = await Promise.all(
Object.values(zip.files)
// to make sure the result array contains the files in the same ordering
.sort(({name: name1}, {name: name2}) =>
new Intl.Collator(undefined, {numeric: true}).compare(name1, name2))
.map((file) => file.async("nodebuffer"))
);
return files;
}else {
const res = // calculate the result files
const zip = new JSZip();
res.forEach((file, i) => {
zip.file(String(i), file);
});
await finished(
zip.generateNodeStream({streamFiles: true})
.pipe(createWriteStream(cacheFile))
);
return res;
}
With this solution, any number of files can be cached in a single zip file.
Conclusion
File-based caching is a powerful tool to speed up applications. But it also makes cache-related errors to survive restarts, so extra care is necessary when implementing it.