Dos and don'ts of collection processing
Why some projects are clean, easy-to-read, and performant, while others a convoluted mess? Why, when making a modification, in some codebases everything falls into place immediately, while in others it’s more like walking on a minefield?
Writing clean code is mostly about restricting yourself from cutting corners and thus enforcing guarantees. The freedom to approach the problems from many different directions brings the responsibility not only to make the code run and do so in a performant manner but also to make it maintainable in the long run.
This list is a compilation of best practices to follow and antipatterns to avoid in order to lower the long-time cost of code involving collections. They are mostly based on principles originating from functional programming, so if you have the same background, you might find them familiar or even trivial.
A deeper understanding of how collections work helped me transition from producing code that takes more time to understand than to write, to one that is easy-to-read. I hope this compilation helps you do the same.
Do: break functions into smaller parts
A common mistake is to write functions that do many things:
Instead, break them into smaller parts, each responsible for one part of the logic:
(To learn how to use these functions, click here)
Don’t: modify the arguments
Modifying the elements of the collection seems like an optimization, as you need one less iteration. But doing so offsets the WTFs/minute code quality indicator in the wrong direction.
Don’t modify the elements in the iteratee:
Instead, do a map and copy the object:
Treat the input data as immutable. Processing should not change it in any way.
As a speedup, you can use a collection implementation that embraces immutability. With structural sharing, they can offer better performance than simple copying.
Do: use object/array destructuring
When you need only a subset of the parameter objects, use destructuring.
Object destructuring only works with objects. There is a proposal to add support to arbitrary structures, but unfortunately, it had not gained traction.
Iterables, on the other hand, are fully supported. As a result, you can destructure not only Arrays, but for example, ImmutableJs Lists, generator functions, and custom types.
Don’t: rely on state
The iteratee should not modify anything and should use as little data bar its parameters as possible.
Instead of writing a state like this:
Use a different structure that doesn’t rely on it:
(To learn how to use constants and make your code more readable, click here)
Don’t: rely on the ordering of filter and map
Use the index argument most of the functions get:
Do: use the array parameter instead of the closed array
The closed array is usually accessible in the iteratee function but use the parameter instead.
Don’t do this to make a collection of points into a circle:
Do this instead:
This practice makes the code more portable. The less of the environment the functions are using, the better.
Don’t: optimize prematurely
Optimization is usually a tradeoff between speed and readability. It’s tempting to focus on the former at the expense of the latter, but it’s usually a bad practice.
Collections are usually smaller than you think they are, and the speed of the processing is usually faster. Unless you know something is slow, don’t make it faster. Clean but slow code is easier to speed up than a fast convoluted mess to maintain.
Prefer readability, then measure. If the profiler shows a bottleneck, optimize only that bit.
Don’t: use streaming processing (only when you know you need it)
Streaming–or lazy–processing is when only the minimum amount of processing is done to produce each result element, instead of running the steps to completion before starting the next one.
map an Array twice, all the elements are processed each time:
Before the second
map runs, all the elements had been processed by the preceding step.
Contrast that with a streaming implementation, using ImmutableJs:
Every element is processed completely before work on the next one starts.
(To learn how to use ES6 generator functions, click here)
What seems like a performance boost and some savings in terms of memory, in practice, most of the time, the difference is negligible. On the other hand, you can easily end up with an infinite loop, more complex code, or even degrading performance.
Unless you know the performance benefits, or you work with structures you otherwise can’t (for example, infinite ones), avoid streaming processing.
(Want to know how to use generator functions with ImmutableJs? click here)
Do: use the Array functions if they are all you need
Arrays have most of the basic processing functions. If all you need is a
some, use them directly on the Array without any library or wrapper. If, in the future, you’ll need a function that is missing, you can refactor them easily.
Do: use an extensible pipeline implementation
In cases when you need to use a library of functions, opt for one that is extensible. It might not seem important the first time you want to use
last or a similarly absent function, but later when you have complex steps, you’ll find it increasingly difficult to migrate to one.
ImmutableJs’s update and Lodash/FP are my go-to choices.
(To learn how Lodash/FP works, click here)
Don’t: build an Array in reduce
reduce is the most versatile function used to process collections. It can emulate the others, as the return value can be anything.
But that doesn’t mean you should use it when a more specialized function is available.
While this code, using
Is the same as this one, using a
filter and a
The latter communicates the intention far better.
As a rule of thumb, whenever you write a
reduce that returns a collection instead of a value, consider whether a different function would be a better alternative. As usual, there are some exceptions, but using them usually results in more readable code.
Do: use a flat structure
A flat structure is when most of the code is on the same indentation level. Using collection pipelines, it is usually a feature you get for free:
In some cases, this flat structure is compromised:
The flat structure is gone, and as a result, the order of operations is messed up. This should be a red flag that something is wrong.
This is the scenario when using an extensible collection pipeline implementation pays off. Those support adding functions while retaining the structure. For example:
As usual, there are exceptions. For example, when you need processing inside a processing step:
In complex use cases, structures like this emerge. If you want, you can move the iteratees to separate functions:
Whether you want to move complex steps to separate functions or not is a matter of preference and style. Some people prefer a lot of small functions with descriptive names, some argue that having everything in one place is better.
As for myself, I prefer monoliths as long as I can understand the functionality without much difficulty. When all the functions follow a purely functional style, refactoring them is trivial.
There is an exception though. If you need the same step in multiple places, move that to a separate function instead of copy-pasting.
Do: provide an initial value for reduce
reduce has an optional second argument: the initial value. It seems like it is only useful when
undefined is not suitable, but there is a critical difference. When an Array has no elements,
reduce without an initial value does not return
undefined, but throws an Error instead:
But with an initial value set, the result is that value:
You should always supply an initial value for
reduce if there is a sensible one.
Do: use the shorter form of arrow functions
Arrow functions have shorter forms, making them more suitable for one-liners. Instead of writing out the full structure:
If there is only a
return statement, the block is optional:
For simple steps, prefer the latter. This usually makes them one-liners, and those are easier to scan.
The parentheses around the parameters are also optional, making the construct even shorter.
Do: reuse parts of the pipeline
If you need a piece of functionality in several places, move that to a separate function and use that, don’t copy-paste.
For reusing complex steps (for example,
map), use a pipeline implementation that supports that.
Do: know the complexity of the operations
Knowing the algorithmic cost of each function helps estimate the overall complexity of the pipeline.
A few commonly used functions:
O(n), but look out what you do in the iteratee function. If you build an array, it can easily be
O(n * log(n))
O(n), and they can short-circuit on the first
Generally, the complexities are added for subsequent steps and multiplied for nested ones.
For a flat structure like this:
The complexity is
O(n + n + n), which is still
For a nested structure:
The complexity is
O(n * n).
As a rule of thumb, having
O(n * log(n)) should be scalable,
O(n^2) will only work for small arrays, anything bigger can easily be a bottleneck.
Apart from a few general guidelines, best practices are the result of the environment too. If, for example, you work with memory-constrained embedded systems, your rules might be entirely different.
Also, software craftsmanship is more of an art than exact science, and as such, everything you write has your distinctive style. But I believe if you have no well-defined reason not to, following the principles above will help you write better code.
Is there anything you’d add? Let me know, and I’ll update this post.