Thursday, September 27, 2007

Min-Wise Independent Permutations

I recently took some time to read through Min-Wise Independent Permutations (pdf), by Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. It was referenced by the Google News Personalization paper (see my previous post here) as one of the fundamental principles behind one of their large-scale personalization algorithms.

The paper spends most of its time defining upper and lower bounds on families of min-wise independent permutations, but the introduction and first couple sections are very interesting and give their motivation: detecting duplicate documents in the AltaVista index.

First, their definition of min-wise independent permutations:
We define and study the notion of min-wise independent families of permutations. We say that F ⊆ Sn is min-wise-independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have

Pr
( min{ π(X) } = π(x) ) = 1 / |X|

In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π.
In other words, given a permutation, every element has the same probability of being its subset's smallest element.

Why is this an interesting and important class of permutations? Suppose you have a large set of documents, and you want to figure out which ones are duplicates (say, you're AltaVista, among the first to look at web-scale datasets, and want to get rid of all the identical documents in your index put up by spammers). Here's a sketch of how AltaVista attacked it:

Represent a document D by a bag of words, and represent those words by natural numbers (say, 'an' = 1, 'the' = 2, 'supercalifragilisticexpialidocious' = 3, ...). Then, a document D can be represented as a set of natural numbers: D ⊆ N. For two documents, D1 and D2, their similarity can be measured as the size of their set intersection divided by their set union:
( D1 ∩ D2 ) / ( D1 ∪ D2 )
Though it isn't demonstrated in the paper, they then claim that:
Pr( min{ π(D1) } = min{ π(D2) } ) = ( D1 ∩ D2 ) / ( D1 ∪ D2 )
That is, the probability of equal minimum hash values of two sets equals the similary measure specified above.

So, to determine how similar D1 and D2 are without using a computationally-expensive distance function on the two, we can generate independent random permutations of each document. So, D1 is represented as, say, 100 sets of natural numbers { S1, S2, S3, ..., S100} where ( S1 ∪ S2 ∪ S3 ∪ ... ∪ S100 ) = D. For each set Sn, compute its minimum hash value, and store the set of minimum hash values. So, now D1 is represented as a set of minimum hash values { H1, H2, H3, ..., H100 }. By the relationship between minimum hash values and their original intersected document word sets detailed above, we can quickly compute the approximate similarity of D1 and D2 by computing the set intersection and set union of their minhash values.

Google News' Minhash-based personalization algorithms exploit this method in a slightly different way. Instead of computing the similarity of documents, Google News computes the similarity of news-article-view records. The direct corollary is:

Internet Document : Words :: Google News User : Google News Articles Viewed.

Brilliant. See the News Personalization Paper (pdf) for details.

Saturday, September 22, 2007

Compressed Sensing

Via a link posted by Jeff Hammerbacher, an interesting guy from Facebook, comes my current fascination with Compressive Sensing.

Compressive Sensing is a subfield of remote sensing study that attempts to resolve some of the difficult issues that arise when distributing a large number of small, failure-prone, and power-limited sensing devices. Suppose you have 1000 sensors, which need to work on a single battery for several years, and of which the expected failure rate is 20%. How to do that and get useable data at the end of the survey without breaking the bank is a difficult question.

Compressive Sensing then enters the picture. Taking queues from post-data-collection lossy compression, such as JPEG, which substantially reduces the amount of information required to faithfully represent the original information, Compressive Sensing asks "If we're going to throw all that data away in the end, after collecting the data, could we simply not collect the data to begin with?," which is tantamount to asking "Can we only record what will be important in the end?"

Of course, intuitively if you know what's going to be important in the end, there's really not that much point in recording it. We don't record things that we already understand, so how can you determine ahead of time what will be needed? It seems to go against all fundamentally held conceptions in data mining -- collect EVERYTHING, figure out what's important later.

Compressed Sensing, however, makes the incredibly counterintuitive leap that one should actually record only a completely random subset of the data random linear combination of the data (see update below), not paying any attention to the question of what will be important in the end. Linear algebra will come to your rescue and reveal anything you want to know by discerning its tertiary effects on what was actually collected.

Some useful links:

How does this relate to Machine Learning, and why am I interested in it here? Well, partially it's just mad cool, and uses some incredible mathematics, plus I'm in love with Terence and drip over his every word. But also, perhaps more legitimately, Compressive Sensing has many interesting applications to dimensionality reduction. Just as one can represent an image by recording a completely random subset of the image, one can represent a high-dimensional feature space by taking completely random projections. That set of projections can then be used in your favorite algorithms, and Compressive Sensing comes to the rescue at the end when it's time to re-interpret the algorithm's output.

Igor Carron of Machine Learning (Theory) gives a nice overview of its potential impact: http://hunch.net/?p=273

Update: Igor Carron points out in the comments below that it's not a completely random subset of the data that's needed, but specifically a projection of the source data with a basis incoherent with that of the source data. A set of random linear projections turns out to do the trick.

Saturday, September 8, 2007

Stalled Netflix work

I started looking at working on the Netflix dataset today, but got stalled when trying to load the data into a MySQL database.  Python doesn't have built-in database connections, and I didn't relish the idea of spending my Saturday wading around in unix path hell.  Another approach, perhaps, tomorrow.

Sunday, September 2, 2007

Netflix Prize readings.

A pretty interesting set of blog articles from people participating in the Netflix Prize:
  • The Netflix Prize: 300 Days Later - Is it even possible to reach a root-mean-squared-error of 0.85-ish on the dataset they provided? How much better for the customers is that than just taking the average rating for the movie? Not too much, and we may need a lot more information about the users than their simple rating history to make further improvements. Should we be investing so much hope in Recommender Systems? Perhaps not, and it could even be dangerous.
  • Netflix Update: Try This at Home - Some simple approaches to the dataset by reducing movies and users to engagement with an arbitrary number of attributes. Like action movies, and hate chick flicks? You might like movies that are very action-y and very much not chick-flick-y.
  • Netflix Prize Results and Source Code - An account of the author's efforts in the prize. Lots of references to interesting subjects at the bottom. He brings up the idea of clustering the movies as a potential boost to the recommendations, which brings it in line with the Netflix Update article above. Seems reasonable to me, if you like romantic comedies a lot, you'll probably rate romantic comedies pretty highly, provided that other people who like romantic comedies rate it highly as well. Yay, I've invented Naive-Bayes!
Now I'm intrigued. Bad news for a person with no free time.