A few days ago I was collecting content for an app of mine, and after a while I had quite a lot. I wanted to filter out lines I had found multiple times from different sources, but the data set was beyond the size to do it manually. I’ve googled for a while for some Google Docs extensions that would do that, but none of them was working. I’ve some previous experience with text similarity, so I wrote a simple script to analyze the data set for me.
So the problem is: I have a text file with many lines and I want to know the most similar lines in it. Sadly this can not be fully automatized. For example, the world the and teh are very different, although they are just a mistype away, while French King Louis II and French King Louis VII seem very similar, they are actually 258 years away from each other. So an algorithm can only provide possible duplicates.
Finding the closest pairs
Note: I’m using underscore.js when working with collections, as it makes it much clearer and easier.
Let’s assume we have read the lines to a variable called lines, and it is an array and contains one element per line. First, we need to construct all the pairs from the texts. For this, we need to map the elements to return an array of the pairs, keeping in mind that we don’t want to include the same pair in a reverse order too. This is done with the following snippet:
Note the map function’s arguments: the first is the actual element, the second is the actual index, and the third is the whole list. These all are needed here. Also, we can not compare the strings in order to avoid the reverse pairs to end up in the result, as there can be duplicates. So we need to use the index here. This produces the following:
In order to get the pairs in an array, we need to flatten the array 1 level:
This gives us:
Next, calculate the distance between all pairs and order them in a decreasing way:
This gives us:
So far so good, it’s almost done. But let’s look at a corner case! Consider the following lines:
This gives us the 3 letter words first, and only then the longer lines, although they are very different and the longer lines are very similar. To overcome this, we need to normalize the distance to be proportional to the length of the lines (the longer one in the pair actually). To do this, simply divide the distance with the longer text:
This way we get the correct result:
The running time is clearly O(n^2) as we need to calculate all pairs because we need to order them. For a few hundred lines it ran in a few seconds, and I think it should be alright for a few thousands at least.