Better collection processing with collection pipelines

Why moving past loops is an important step

Tamás Sallai

5 mins

Want to learn AWS serverless development? Click here

Collections, collections everywhere

Collection processing is an everyday task. So much so, most of the program logic is about transforming, searching, ordering data. Mastering it, therefore, is an essential skill to move up the programmer ladder.

When you work on a leaderboard that shows some data of users ordered by score, or on a company dashboard that shows the headlines of the latest news, or even on a game of chess that draws the board, the core of them is to process collections.

Learn the basics, and how to do it right from this series.

Book

Building GraphQL APIs with AWS AppSync

How to design, implement, and deploy GraphQL-based APIs on the AWS cloud

The `for` loop

The traditional approach — i.e., how everyone did it 20 years ago — is to use a for loop.

The typical scenario is to have an array of elements, and want to do something with each of them. Like adding three (Try it):

const array = [5, 10, 15];
for (let i = 0; i < array.length; i++) {
	array[i] += 3;
}
console.log(array); // 8, 13, 18

The problem

This seems like a good solution, but unfortunately, there are many problems with this approach.

The first is the side effect: when this code runs, it modifies the original array. This problem is not immediately apparent, and it is ostensibly the best solution.

But when you pass the same array around, it quickly becomes a problem.

Consider the following example, where the first call modifies the array, and thus the second function returns the wrong result (Try it):

const users = [
	{active: true},
	{active: true},
	{active: false}
];
const active = countActiveUsers(users); // 2
const all = countAllUsers(users); // 2 !!!

If the first function call removes non-active users, the second one can no longer count the total.

While changing the order of the calls solves this particular problem, it also creates a constraint. For larger codebases, these constraints accumulate, and simple changes might break totally unrelated parts of the program.

Ever felt that no matter what you change, something breaks? Now you know one quite common cause.

Avoiding side effects

Let's move on to step two! How to solve the side-effect problem?

Do not modify the original array, but build a new one (Try it):

const array = [5, 10, 15];
const result = [];
for (let num of array) {
	result.push(num + 3);
}
console.log(result); // 8, 13, 18

Note: Since the index is no longer used, the for..of loop provides a shorter version.

This approach avoids the side effect. Problem solved.

Runaway complexity

The other problem is that for loops can quickly get out of control. This is especially true for nested loops, the truer the more nested they are (Try it):

const array = [5, 10, 15];
const result = [];

for (let num of array) {
	if (num % 2 !== 0) {
		result.push(num < 10 ? num * 2 : num);
	}
}
console.log(result); // 10, 15

The above code keeps only the odd numbers and doubles those that are less than 10. Simple specification, but it is not immediately obvious from looking at the code.

A one-page-long, nested to multiple layers for loop is usually enough to survive eternally, as no one dares to refactor it; and the brave souls who do fail spectacularly.

Towards a solution

Fortunately, most real-world problems require only a handful of operations. And by combining them, most of the use cases can be covered (you'd be surprised by how many!).

We'll look into two of these basic building blocks: the map and the filter.

The `map` function

The map is the one we've already used. It transforms the elements of a collection and returns a new one.

The tricky part is how to write a generalized function that can transform the elements in every possible way?

The solution is the iteratee function. The map gets a function as one of its parameters that does the transformation. It's signature is (element) => newElement.

A simple map implementation (Try it):

const map = (coll, iter) => {
	const result = [];
	for (let e of coll) {
		result.push(iter(e));
	}
	return result;
}

And its usage:

const array = [5, 10, 15];
map(array, (i) => i + 3); // 8, 13, 18

The `filter` function

filter makes a subarray, using a predicate. The predicate follows the same principles as the iteratee for the map, but its signature is (element) => bool. Only the elements the predicate returns truthy for will be present in the result.

A simple filter implementation is:

const filter = (coll, iter) => {
	const result = [];
	for (let e of coll) {
		if(iter(e)){
			result.push(e);
		}
	}
	return result;
}

And its usage:

const array = [5, 10, 15];
filter(array, (i) => i % 2 === 0); // 10

Composition

Now that we have the building blocks, let's consider how to compose them for more complex processing!

Just call them one after the other, passing the intermediary result (Try it):

const array = [5, 10, 15];

filter(
	map(
		array,
		(i) => i + 3
	),
	(i) => i % 2 === 0
) // 8, 18

Well, this looks ugly, not to mention that it is written backward.

Instead, store the intermediary results:

const array = [5, 10, 15];

const mapped = map(array, (i) => i + 3); // 8, 13, 18
const filtered = filter(mapped, (i) => i % 2 === 0);
console.log(filtered); // 8, 18

Collection pipelines

Usually, you don't need the intermediary results, and their only purpose is to prevent writing the operations backward. As it's error-prone to have many one-shot variables around, it would be better to define multiple processing steps and just pass in the source collection and get back the result.

These are collection pipelines.

Implementations

There are several ways to define collection pipelines. In the next episodes in this series, we'll look into the different implementations, and detail their strengths and weaknesses.