Find Duplicate File Names in CouchDB

I was stumped for a bit, trying to figure out how to help my editorial staff avoid uploading the same file twice. In a repository spanning tens of thousands of titles in over a hundred different collections, our staff can’t easily tell whether a document is already in a collection or not.

Turns out that finding duplicate attachments is fairly easy. First create the view:

function(doc) {
  if (doc._attachments){
    for (var i in doc._attachments){
      emit([doc.collection, i], doc._id);
    }
  }
}

Which returns JSON output that looks like this:

[“collection name”, “filename.rtf”]

So all I have to do to find the duplicates is query that view using the composite key and see if it returns any rows:

http://my.couchdb.server:5984/database-name/_design/my-listings/_view/attachment-exists?key=[“collection name”,”filename.rtf”]

I could do the same with MD5 checksums, too, but I won’t. The problem is that even a single character change is enough to make two documents different. So if someone opens their copy of a file and Word changes the metadata in it, it’s no longer byte-for-byte identical, even though the text has not changed. This means that the number of false negatives (i.e. duplicate files that are NOT found) would be too high for people to rely on.

What I’d really like to find is an algorithm that determines whether the textual content of two documents is significantly similar….