Friday, July 3, 2009

I have not kept up on the blog -- but now that the Grand Prize Threshold has been pasted, it seems time to do an update.

Sometime back in April, I made a decent jump from a 0.93 score to 0.9142. This was result of switching from gradient hill climb (like I have been using for 2 1/2 decades) to the instantaneous update based on Simon Funk's blog. I still have some of my own tweaks in it, such as a dynamic stepping (aka learning) rate. With the gradient, I was using the "fit a quadratic curve in direction of gradient". With the instantaneous update, that is not possible and so I switched to a more simple version of change rate by a factor. Factor = 1.1 if it was a good step and 0.2 if it was a bad step.

Since then, I have been trying a number of things -- none of which have helped very much.

The thing I had the most hope for was a clustering algorithm of users.

I define the center of a cluster as the vector which is the essentially the average of the movie ratings for each user in the cluser. To give a bit of inertia, I also toss in the overall average over all users onto the sum of ratings before dividing by number of ratings + 1.

Then define the distance of any user from any cluster as the correlation coefficient of that user's ratings against the center of the cluster. Use that distance to adjust the cluster membership. Of course, once you adjust the cluster membership -- the center changes. Hence, rinse and repeat -- until moderately stable.

That seems to define reasonable clusters. The problem comes in what to do with them. I've tried adding on a correction score for members of a cluster -- no big help. I've tried using a separate (feature,movie) matrix for each cluster in an SVD like algorithm. That adds a lot of parameters, and doesn't help much. Perhaps someone else reading this will think of a better way of using the clusters thus formed and do something worth while with them.

Friday, February 13, 2009

I have managed to do a bit better. Instead of just over the threshold, I have moved up to an rmse score of .9308. This had 21 features, climbing one feature at a time -- with an eye on probe rmse to prevent too much over training.

Back to the think pad now to see what else to try.

Saturday, January 3, 2009

Netflix contest, on the leader board

This is a simple blog to record any thoughts on the netflix contest. FYI, I am working on this in my free time only. I am a retired mathematician and do part time consulting work in information security through the team of Information Security Systems Inc (ISSI).
My first entry that got past the minimal threshold and put me into the grand spot of 1343 or so is based on a simple SVD model with 12 features plus four linear terms which measure global effects for movies, customers, time and gross average.
I have some other ideas that I'll be trying from time to time.
I've made a few observations about the data on the netflix forum, and will continue posting there.