Friday, July 3, 2009

I have not kept up on the blog -- but now that the Grand Prize Threshold has been pasted, it seems time to do an update.

Sometime back in April, I made a decent jump from a 0.93 score to 0.9142. This was result of switching from gradient hill climb (like I have been using for 2 1/2 decades) to the instantaneous update based on Simon Funk's blog. I still have some of my own tweaks in it, such as a dynamic stepping (aka learning) rate. With the gradient, I was using the "fit a quadratic curve in direction of gradient". With the instantaneous update, that is not possible and so I switched to a more simple version of change rate by a factor. Factor = 1.1 if it was a good step and 0.2 if it was a bad step.

Since then, I have been trying a number of things -- none of which have helped very much.

The thing I had the most hope for was a clustering algorithm of users.

I define the center of a cluster as the vector which is the essentially the average of the movie ratings for each user in the cluser. To give a bit of inertia, I also toss in the overall average over all users onto the sum of ratings before dividing by number of ratings + 1.

Then define the distance of any user from any cluster as the correlation coefficient of that user's ratings against the center of the cluster. Use that distance to adjust the cluster membership. Of course, once you adjust the cluster membership -- the center changes. Hence, rinse and repeat -- until moderately stable.

That seems to define reasonable clusters. The problem comes in what to do with them. I've tried adding on a correction score for members of a cluster -- no big help. I've tried using a separate (feature,movie) matrix for each cluster in an SVD like algorithm. That adds a lot of parameters, and doesn't help much. Perhaps someone else reading this will think of a better way of using the clusters thus formed and do something worth while with them.