progress on the netflix front
Mood: determined
Posted on 2006-10-11 08:39:00
Tags: netflixprize
Words: 493

So, although djedi's parents were here most of the last week, I managed to make some progress on the Netflix Prize. (probably at the cost of being a bit rude - sorry!) Most of the time I spent was building up a test framework so I could easily test rating schemes, so there wasn't a whole lot of note. The one big things was that I restored the data provided to us in a binary format, so it takes up 500 MB on disk instead of 2 GB, which means the loading time is down to 15-20 minutes instead of 40 minutes.

So the way it works is that there is a real set of ratings to be made that counts, and there is a "probe" set of ratings to be made. The "probe" ratings are all already known (they're included in the 500 MB of data), so it makes it easy to exclude that data, run your algorithm, and then see how good your data is. Last night was the first time I got to actually run some algorithms and see how they did on the probe data. The scoring method used is RMSE (root mean squared error), so lower is better. For comparison, Cinematch (Netflix's algorithm) scores 0.9514, if you get below 0.9419 (1% improvement) you can win $50,000, and if you get below 0.8563 (10% improvement) you can win $1,000,000. (here's the current leaderboard - note that someone qualifies to win $50,000 already!)

Just taking the average movie rating for all movies and applying that gives an RMSE of 1.13 or so. Taking the average movie rating for each movie and using that on a per-movie basis gives 1.05. The two other things I tried were first to take the average movie rating and average user rating and average them - this gives a modest improvement to 1.015. I then weighted them by the inverse of their standard deviation (since a higher standard deviation would mean there was more variability in that data and thus it would be less reliable), but that only improved it to 1.013.

So now I'm calculating the correlation between every pair of movies using the dot product. (I ran this overnight and it took around 7 hours, but I need to store the data in a different format so I started that before leaving for work). Once I have that data it should be fairly straightforward to apply that to all the other movies a user has rated and come up with a new rating. I think this might push me below 1.0, which would be really nice...and I might even submit that even though right now it needs to be below 0.9884 to make it on the leaderboard.

I also found a good paper that I'm reading through and I might try next, although it looks computationally really expensive.

Anyway, that's most of what's been on my mind lately. I like getting to play with raw data!


9 comments

Comment from djedi:
2006-10-11T17:21:22+00:00

Sounds interesting and tough. Stuff liek that is totally not my area. Maybe we should try to find some time to talk about it though. Man, it's sad that we might have to sort of schedule in time to talk about it. :P

Comment from gregstoll:
2006-10-11T17:54:17+00:00

Just to be clear, I'm happy to talk about it anytime. Even if you're not interested! (or present!)

Comment from abstractseaweed:
2006-10-11T20:23:43+00:00

I took a look at that challenge, and it's really intriguing. I've wondered about that type of computation before, like what Yahoo Launchcast uses to determine which songs you might like. I will have to add this to my list of math-type topics to explore.

My current mathnerd project involves a using a "Bees" algorithm to find the best Settlers of Katan strategy. Just starting it, but I totally know what you mean about having it on your mind.

Comment from gregstoll:
2006-10-11T21:43:25+00:00

Ooh, that sounds interesting! What's a "Bees" algorithm? (and does it apply to all Settlers strategy or settlement placement or when to upgrade to cities or...?)

Comment from abstractseaweed:
2006-10-11T22:15:37+00:00

It basically starts with random guesses as to the best solution, then generates new random guesses in the vicinity of the best ones plus some new completely random guesses, and in that way "hones in" on the best solution.

A couple links:
one
two

In my experiment I'm starting very very very simple, stripping out most of the complexity in Settlers. The goal is to choose the best place to build after the appropriate resources have been obtained to do so. At first there will be no upgrades to cities, no robbers, and no ports. I'll gradually add in complexity after I get the basics working.

Comment from dplass:
2006-10-19T21:12:06+00:00

(Full disclosure: I'm "working" on the Netflix prize too).

All I've done is some thinking (and loading the data into mysql)

What do you mean by "calculating the correlation between every pair of movies using the dot product"? Specifically, "correlation" and "Once I have that data it should be fairly straightforward to apply that to all the other movies a user has rated and come up with a new rating" - how?

If you don't want to reveal your thoughts/algorithms, that's cool.

Comment from gregstoll:
2006-10-19T23:25:01+00:00

Cool!

So there are many ways (from what I understand) that you can calculate the correlation between two movies. So let's say you have movies M1 and M2. I looked at all users U that have rated both M1 and M2. The correlation is roughly (U's rating of M1 - U's average rating)*(U's rating of M2 - U's average rating). If you think of those two terms as elements of two vectors V1 and V2, you're calculating V1 dot V2/(magnitude(V1)*magnitude(V2)), which is (in 2 dimensions at least) the cosine of the angle between them. That gives you a value between -1 and 1.

Now if you want to approximate what a user U would rate a movie M, just look at the correlation scores of M, and use that as weights to average U's ratings of other movies.

It's pretty straightforward, but I'm tired and a little hazy right now so my explanation doesn't seem clear even to me. If you're interested I can explain more over email :-)

Comment from dplass:
2006-10-20T07:55:44+00:00

I think I understand. But what i was thinking about more was to correlate the users and not the movies. In other words, "find users who are like me" and ask "how did *they* rate this movie?" It's collaborative filtering (see http://en.wikipedia.org/wiki/Collaborative_filtering) where the steps are:

1. Look for users who share the same rating patterns with the active user (the user who the prediction is for).
2. Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user

Comment from gregstoll:
2006-10-20T08:29:00+00:00

Ah, yup, that's what I'm working on now. It's kinda the same thing but reverse the users and movies, really...

This backup was done by LJBackup.