Thursday, September 27, 2007

Min-Wise Independent Permutations

I recently took some time to read through Min-Wise Independent Permutations (pdf), by Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. It was referenced by the Google News Personalization paper (see my previous post here) as one of the fundamental principles behind one of their large-scale personalization algorithms.

The paper spends most of its time defining upper and lower bounds on families of min-wise independent permutations, but the introduction and first couple sections are very interesting and give their motivation: detecting duplicate documents in the AltaVista index.

First, their definition of min-wise independent permutations:
We define and study the notion of min-wise independent families of permutations. We say that F ⊆ Sn is min-wise-independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have

Pr
( min{ π(X) } = π(x) ) = 1 / |X|

In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π.
In other words, given a permutation, every element has the same probability of being its subset's smallest element.

Why is this an interesting and important class of permutations? Suppose you have a large set of documents, and you want to figure out which ones are duplicates (say, you're AltaVista, among the first to look at web-scale datasets, and want to get rid of all the identical documents in your index put up by spammers). Here's a sketch of how AltaVista attacked it:

Represent a document D by a bag of words, and represent those words by natural numbers (say, 'an' = 1, 'the' = 2, 'supercalifragilisticexpialidocious' = 3, ...). Then, a document D can be represented as a set of natural numbers: D ⊆ N. For two documents, D1 and D2, their similarity can be measured as the size of their set intersection divided by their set union:
( D1 ∩ D2 ) / ( D1 ∪ D2 )
Though it isn't demonstrated in the paper, they then claim that:
Pr( min{ π(D1) } = min{ π(D2) } ) = ( D1 ∩ D2 ) / ( D1 ∪ D2 )
That is, the probability of equal minimum hash values of two sets equals the similary measure specified above.

So, to determine how similar D1 and D2 are without using a computationally-expensive distance function on the two, we can generate independent random permutations of each document. So, D1 is represented as, say, 100 sets of natural numbers { S1, S2, S3, ..., S100} where ( S1 ∪ S2 ∪ S3 ∪ ... ∪ S100 ) = D. For each set Sn, compute its minimum hash value, and store the set of minimum hash values. So, now D1 is represented as a set of minimum hash values { H1, H2, H3, ..., H100 }. By the relationship between minimum hash values and their original intersected document word sets detailed above, we can quickly compute the approximate similarity of D1 and D2 by computing the set intersection and set union of their minhash values.

Google News' Minhash-based personalization algorithms exploit this method in a slightly different way. Instead of computing the similarity of documents, Google News computes the similarity of news-article-view records. The direct corollary is:

Internet Document : Words :: Google News User : Google News Articles Viewed.

Brilliant. See the News Personalization Paper (pdf) for details.

1 comments:

Anonymous said...

I don't know how choosing permutations «at random» works out. There are many kinds of random!

As far as constructing min-wise independent permutations, I've found this implementation, I'm not sure what construction it uses; and this paper that gives a construction algorithm, referencing Brodner's method and Indyk's method. I am interested in more.