Detecting possibly duplicate Strings

This is a blog post about the algorithm of detecting possible duplicates from a set of Strings

Background

A few days ago I was collecting content for an app of mine, and after a while I had quite a lot. I wanted to filter out lines I had found multiple times from different sources, but the data set was beyond the size to do it manually. I’ve googled for a while for some Google Docs extensions that would do that, but none of them was working. I’ve some previous experience with text similarity, so I wrote a simple script to analyze the data set for me.

The problem

So the problem is: I have a text file with many lines and I want to know the most similar lines in it. Sadly this can not be fully automatized. For example, the world the and teh are very different, although they are just a mistype away, while French King Louis II and French King Louis VII seem very similar, they are actually 258 years away from each other. So an algorithm can only provide possible duplicates.

Finding similarities

Levenshtein distance

The central part of the algorithm is the Levenshtein distance. It basically calculates a distance of two texts, where every character deleted, inserted or modified means 1 step. So if the distance from 2 strings are small, there is only a few changes needed to make one from the other, whereas if the distance is large, there are many changes needed. This wiki page has a comprehensive set of languages the algorithm is implemented in. As I’m working with a Node.js script, I used the Javascript one:

// Compute the edit distance between the two given strings
function getEditDistance(a, b) {
  if(a.length === 0) return b.length;
  if(b.length === 0) return a.length;

  var matrix = [];

  // increment along the first column of each row
  var i;
  for(i = 0; i <= b.length; i++){
    matrix[i] = [i];
  }

  // increment each column in the first row
  var j;
  for(j = 0; j <= a.length; j++){
    matrix[0][j] = j;
  }

  // Fill in the rest of the matrix
  for(i = 1; i <= b.length; i++){
    for(j = 1; j <= a.length; j++){
      if(b.charAt(i-1) == a.charAt(j-1)){
        matrix[i][j] = matrix[i-1][j-1];
      } else {
        matrix[i][j] = Math.min(matrix[i-1][j-1] + 1, // substitution
                                Math.min(matrix[i][j-1] + 1, // insertion
                                         matrix[i-1][j] + 1)); // deletion
      }
    }
  }

  return matrix[b.length][a.length];
};

Finding the closest pairs

Note: I’m using underscore.js when working with collections, as it makes it much clearer and easier.

Let’s assume we have read the lines to a variable called lines, and it is an array and contains one element per line. First, we need to construct all the pairs from the texts. For this, we need to map the elements to return an array of the pairs, keeping in mind that we don’t want to include the same pair in a reverse order too. This is done with the following snippet:

_.chain(lines)
    .map(function(line,idx,list){
        return _.chain(list)
            .rest(idx+1)
            .map(function(other){return [line,other]})
            .value()
    })

Note the map function’s arguments: the first is the actual element, the second is the actual index, and the third is the whole list. These all are needed here. Also, we can not compare the strings in order to avoid the reverse pairs to end up in the result, as there can be duplicates. So we need to use the index here. This produces the following:

['a','b','c'] => [[['a','b'],['a','c']],[['b','c']],[]]

In order to get the pairs in an array, we need to flatten the array 1 level:

    .flatten(true)

This gives us:

['a','b','c'] => [['a','b'],['a','c'],['b','c']]

Next, calculate the distance between all pairs and order them in a decreasing way:

	.map(function(pair){
        return [pair[0],pair[1],getEditDistance(pair[0],pair[1])];
    })
    .sortBy(function(qs){
        return qs[2];
    })

This gives us:

['a','b','c'] => [['a','b',1],['a','c',1],['b','c',1]]

So far so good, it’s almost done. But let’s look at a corner case! Consider the following lines:

this is a very long line
this is a very very long line
egg
ham
bee
cup

This gives us the 3 letter words first, and only then the longer lines, although they are very different and the longer lines are very similar. To overcome this, we need to normalize the distance to be proportional to the length of the lines (the longer one in the pair actually). To do this, simply divide the distance with the longer text:

		return [pair[0],pair[1],getEditDistance(pair[0],pair[1]) / Math.max(pair[0].length,pair[1].length)];

This way we get the correct result:

[ [ 'this is a very long line', 'this is a very very long line', 0.1724137931034483 ],
  [ 'this is a very long line', 'egg', 0.9166666666666666 ],
  ...
]

Performance

The running time is clearly O(n^2) as we need to calculate all pairs because we need to order them. For a few hundred lines it ran in a few seconds, and I think it should be alright for a few thousands at least.

03 February 2015

Interesting article?

Get hand-crafted emails on new content!