The paper spends most of its time defining upper and lower bounds on families of min-wise independent permutations, but the introduction and first couple sections are very interesting and give their motivation: detecting duplicate documents in the AltaVista index.
First, their definition of min-wise independent permutations:
We define and study the notion of min-wise independent families of permutations. We say that F ⊆ Sn is min-wise-independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we haveIn other words, given a permutation, every element has the same probability of being its subset's smallest element.
Pr( min{ π(X) } = π(x) ) = 1 / |X|
In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π.
Why is this an interesting and important class of permutations? Suppose you have a large set of documents, and you want to figure out which ones are duplicates (say, you're AltaVista, among the first to look at web-scale datasets, and want to get rid of all the identical documents in your index put up by spammers). Here's a sketch of how AltaVista attacked it:
Represent a document D by a bag of words, and represent those words by natural numbers (say, 'an' = 1, 'the' = 2, 'supercalifragilisticexpialidocious' = 3, ...). Then, a document D can be represented as a set of natural numbers: D ⊆ N. For two documents, D1 and D2, their similarity can be measured as the size of their set intersection divided by their set union:
( D1 ∩ D2 ) / ( D1 ∪ D2 )
Pr( min{ π(D1) } = min{ π(D2) } ) = ( D1 ∩ D2 ) / ( D1 ∪ D2 )
So, to determine how similar D1 and D2 are without using a computationally-expensive distance function on the two, we can generate independent random permutations of each document. So, D1 is represented as, say, 100 sets of natural numbers { S1, S2, S3, ..., S100} where ( S1 ∪ S2 ∪ S3 ∪ ... ∪ S100 ) = D. For each set Sn, compute its minimum hash value, and store the set of minimum hash values. So, now D1 is represented as a set of minimum hash values { H1, H2, H3, ..., H100 }. By the relationship between minimum hash values and their original intersected document word sets detailed above, we can quickly compute the approximate similarity of D1 and D2 by computing the set intersection and set union of their minhash values.
Google News' Minhash-based personalization algorithms exploit this method in a slightly different way. Instead of computing the similarity of documents, Google News computes the similarity of news-article-view records. The direct corollary is:
Internet Document : Words :: Google News User : Google News Articles Viewed.
Brilliant. See the News Personalization Paper (pdf) for details.