Thursday, September 27, 2007

Min-Wise Independent Permutations

I recently took some time to read through Min-Wise Independent Permutations (pdf), by Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. It was referenced by the Google News Personalization paper (see my previous post here) as one of the fundamental principles behind one of their large-scale personalization algorithms.

The paper spends most of its time defining upper and lower bounds on families of min-wise independent permutations, but the introduction and first couple sections are very interesting and give their motivation: detecting duplicate documents in the AltaVista index.

First, their definition of min-wise independent permutations:
We define and study the notion of min-wise independent families of permutations. We say that F ⊆ Sn is min-wise-independent if for any set X ⊆ [n] and any x ∈ X, when π is chosen at random in F we have

Pr
( min{ π(X) } = π(x) ) = 1 / |X|

In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of X under π.
In other words, given a permutation, every element has the same probability of being its subset's smallest element.

Why is this an interesting and important class of permutations? Suppose you have a large set of documents, and you want to figure out which ones are duplicates (say, you're AltaVista, among the first to look at web-scale datasets, and want to get rid of all the identical documents in your index put up by spammers). Here's a sketch of how AltaVista attacked it:

Represent a document D by a bag of words, and represent those words by natural numbers (say, 'an' = 1, 'the' = 2, 'supercalifragilisticexpialidocious' = 3, ...). Then, a document D can be represented as a set of natural numbers: D ⊆ N. For two documents, D1 and D2, their similarity can be measured as the size of their set intersection divided by their set union:
( D1 ∩ D2 ) / ( D1 ∪ D2 )
Though it isn't demonstrated in the paper, they then claim that:
Pr( min{ π(D1) } = min{ π(D2) } ) = ( D1 ∩ D2 ) / ( D1 ∪ D2 )
That is, the probability of equal minimum hash values of two sets equals the similary measure specified above.

So, to determine how similar D1 and D2 are without using a computationally-expensive distance function on the two, we can generate independent random permutations of each document. So, D1 is represented as, say, 100 sets of natural numbers { S1, S2, S3, ..., S100} where ( S1 ∪ S2 ∪ S3 ∪ ... ∪ S100 ) = D. For each set Sn, compute its minimum hash value, and store the set of minimum hash values. So, now D1 is represented as a set of minimum hash values { H1, H2, H3, ..., H100 }. By the relationship between minimum hash values and their original intersected document word sets detailed above, we can quickly compute the approximate similarity of D1 and D2 by computing the set intersection and set union of their minhash values.

Google News' Minhash-based personalization algorithms exploit this method in a slightly different way. Instead of computing the similarity of documents, Google News computes the similarity of news-article-view records. The direct corollary is:

Internet Document : Words :: Google News User : Google News Articles Viewed.

Brilliant. See the News Personalization Paper (pdf) for details.

Saturday, September 22, 2007

Compressed Sensing

Via a link posted by Jeff Hammerbacher, an interesting guy from Facebook, comes my current fascination with Compressive Sensing.

Compressive Sensing is a subfield of remote sensing study that attempts to resolve some of the difficult issues that arise when distributing a large number of small, failure-prone, and power-limited sensing devices. Suppose you have 1000 sensors, which need to work on a single battery for several years, and of which the expected failure rate is 20%. How to do that and get useable data at the end of the survey without breaking the bank is a difficult question.

Compressive Sensing then enters the picture. Taking queues from post-data-collection lossy compression, such as JPEG, which substantially reduces the amount of information required to faithfully represent the original information, Compressive Sensing asks "If we're going to throw all that data away in the end, after collecting the data, could we simply not collect the data to begin with?," which is tantamount to asking "Can we only record what will be important in the end?"

Of course, intuitively if you know what's going to be important in the end, there's really not that much point in recording it. We don't record things that we already understand, so how can you determine ahead of time what will be needed? It seems to go against all fundamentally held conceptions in data mining -- collect EVERYTHING, figure out what's important later.

Compressed Sensing, however, makes the incredibly counterintuitive leap that one should actually record only a completely random subset of the data random linear combination of the data (see update below), not paying any attention to the question of what will be important in the end. Linear algebra will come to your rescue and reveal anything you want to know by discerning its tertiary effects on what was actually collected.

Some useful links:

How does this relate to Machine Learning, and why am I interested in it here? Well, partially it's just mad cool, and uses some incredible mathematics, plus I'm in love with Terence and drip over his every word. But also, perhaps more legitimately, Compressive Sensing has many interesting applications to dimensionality reduction. Just as one can represent an image by recording a completely random subset of the image, one can represent a high-dimensional feature space by taking completely random projections. That set of projections can then be used in your favorite algorithms, and Compressive Sensing comes to the rescue at the end when it's time to re-interpret the algorithm's output.

Igor Carron of Machine Learning (Theory) gives a nice overview of its potential impact: http://hunch.net/?p=273

Update: Igor Carron points out in the comments below that it's not a completely random subset of the data that's needed, but specifically a projection of the source data with a basis incoherent with that of the source data. A set of random linear projections turns out to do the trick.

Saturday, September 8, 2007

Stalled Netflix work

I started looking at working on the Netflix dataset today, but got stalled when trying to load the data into a MySQL database.  Python doesn't have built-in database connections, and I didn't relish the idea of spending my Saturday wading around in unix path hell.  Another approach, perhaps, tomorrow.

Sunday, September 2, 2007

Netflix Prize readings.

A pretty interesting set of blog articles from people participating in the Netflix Prize:
  • The Netflix Prize: 300 Days Later - Is it even possible to reach a root-mean-squared-error of 0.85-ish on the dataset they provided? How much better for the customers is that than just taking the average rating for the movie? Not too much, and we may need a lot more information about the users than their simple rating history to make further improvements. Should we be investing so much hope in Recommender Systems? Perhaps not, and it could even be dangerous.
  • Netflix Update: Try This at Home - Some simple approaches to the dataset by reducing movies and users to engagement with an arbitrary number of attributes. Like action movies, and hate chick flicks? You might like movies that are very action-y and very much not chick-flick-y.
  • Netflix Prize Results and Source Code - An account of the author's efforts in the prize. Lots of references to interesting subjects at the bottom. He brings up the idea of clustering the movies as a potential boost to the recommendations, which brings it in line with the Netflix Update article above. Seems reasonable to me, if you like romantic comedies a lot, you'll probably rate romantic comedies pretty highly, provided that other people who like romantic comedies rate it highly as well. Yay, I've invented Naive-Bayes!
Now I'm intrigued. Bad news for a person with no free time.

Friday, August 31, 2007

MinHash Clustering in Google News Personalization

I've begun looking at the Google News Personalization paper that was presented at the 2007 WWW conference. There are two algorithms evaluated for web-scale clustering of users for use in other collaborative filtering systems: MinHash and Probabilistic Latent Semantic Indexing.

You can find the program notes here: http://www2007.org/paper570.php
and (linked to from the program notes) a PDF version of their paper here: http://www2007.org/papers/paper570.pdf

I find MinHash counterintuitive and confusing, but I'm looking at it more closely now. Some of the resources I've found so far are:
I'll try to look through all these papers and give a summary of what MinHash is, and how suitable it might be for feature spaces of very large dimensionality, which is a problem space that I happen to be working in at present.

Friday, June 29, 2007

Stats 202: Statistical Aspects of Data Mining

There's a new lecture series that began at Google last Tuesday mirroring one taught at Stanford.  Statistics 202: Statistical Aspects of Data Mining.  You can find the course website at http://www.stats202.com/, and videos of the lectures on Google Video with the search [Statistical Aspects of Data Mining].  I highly recommend following the course (there will be new videos posted frequently, soon after each lecture on Tuesdays and Fridays).

Saturday, June 23, 2007

To begin...

At first, I thought I would begin this blog by defining, and then quickly summarizing, machine learning. After some thought and poor starts on that first post, though, I've found that the concept's so vague and broad that it defies a clear definition. Here's my best shot though, a quick and dirty attempt:

Applications of Machine Learning, or Statistical Learning (SL) as I'll refer to it in these first few posts, are generally attempts to apply statistical models to large datasets in an attempt to draw out patterns that are too complex to be represented algorithmically, and use that model to properly interpret new information.


There's a curve that I rode when I started learning about Statistical Learning: first, I had some difficulty understanding how one could possibly use a statistical model to predict something that seems far too random or subtle for direct modeling. Second, I began to hopefully suspect that SL could be used to solve almost every computational problem in the world given enough data. Third, and currently, I've realized that the world and its processes are generally far more complex than we'll be able to model, even statistically, for a very, very long time.

This blog will hopefully take the readers along that path as well, or at least chronicle my experiences along the way. Expect disillusionment, and a fairly thorough explanation of how the entire field shouldn't be called a field, can't be termed a study, but is instead an amusing collection of tricks pulled from a statistician's hat -- tricks that are incredibly useful, and in some cases far more effective than any alternative, but simple tricks nonetheless.

Anything I discuss in detail will have the clearest explanation of any background knowledge that I can provide, though I will assume a rudimentary knowledge of statistics and programming. Anywhere that ability in those two fields is necessary, I'll provide links to useful internet articles or books that have helped me.

I'll also frequently post links to papers and talks from the field, with some minor comments, though the reader is left the difficult task of getting any background necessary to read them.

In the next few posts, a vague discussion of how one can design a Statistical Learning system by self-examination, and how SL can make an amusing, though poor approximation to a master author.