Friday, August 31, 2007

MinHash Clustering in Google News Personalization

I've begun looking at the Google News Personalization paper that was presented at the 2007 WWW conference. There are two algorithms evaluated for web-scale clustering of users for use in other collaborative filtering systems: MinHash and Probabilistic Latent Semantic Indexing.

You can find the program notes here: http://www2007.org/paper570.php
and (linked to from the program notes) a PDF version of their paper here: http://www2007.org/papers/paper570.pdf

I find MinHash counterintuitive and confusing, but I'm looking at it more closely now. Some of the resources I've found so far are:
I'll try to look through all these papers and give a summary of what MinHash is, and how suitable it might be for feature spaces of very large dimensionality, which is a problem space that I happen to be working in at present.