Netflix Prize Summary: Factorization Meets the Neighborhood

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.)

This is a summary of Koren’s 2008 Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model.

There are two approaches to collaborative filtering: neighborhood methods and latent factor models.

  • Neighborhood models are most effective at detecting very localized relationships (e.g., that people who like X-Men also like Spiderman), but poor at detecting a user’s overall signals.
  • Latent factor models are best at estimating overall structure (e.g., that a user likes horror movies), but are poor at detecting strong associations among small sets of closely related items.

Since the two approaches have complementary strengths and weaknesses, we should integrate the two; this integration is the focus of this paper.

As mentioned in previous papers, we should normalize out common effects from movies. Throughout the rest of this paper, Koren uses a baseline estimate of overall rating mean + user deviation from average + movie deviation from average for the rating of user i on movie i; estimation of the latter two parameters are done by solving a regularized least squares problem.

Koren then describes using a binary matrix (1 for rated, 0 for not rated) as a source of implicit feedback. This is useful because the mere fact that a user rated many science fiction movies (say) suggests that the user likes science fiction movies.

Recall the previous paper, where we modeled each rating $r_ui$ as

$$r_ui = b_ui+ sum_N in N(i; u) (r_uj – b_uj) w_ij,$$

where $N(i; u)$ is the k items most similar to i among the items user u rated, and the $w_ij$ are parameters to be learned by solving a regularized least squares problem.

This paper makes several enhancements to that model. First, we replace $N(i; u)$ with $R^k(i; u)$, the intersection of the k items most similar to i (among all items) intersected with the items user u rated. Also, we denote by $N^k(i; u)$ the intersection of the k items most similar to i with the items user u has provided implicit feedback for. This gives us

$$r_ui = b_ui + sum_j in R^k(i; u) (r_uj – b_uj) w_ij + sum_j in N^k(i; u) c_ij,$$

where the $c_ij$ are another set of parameters to learn.

Notice that by taking the intersection of the k items most similar to i with the items user u rated (giving perhaps a set of size less than k), rather than taking the k items most similar to i among the items user u rated, we let our model be influenced not only by what a user rates, but also by what a user does not rate. For example, if a user does not rate LOTR 1 or LOTR 2, his predicted rating for LOTR 3 is penalized.

This implies that our current model encourages greater deviations from baseline estimates for users that provided many ratings or plenty of implicit feedback. In other words, for well-modeled users with a lot of input, we are willing to predict quirkier and less common recommendations; users we have less information about, on the other hand, receive safer, baseline estimates.

Nonetheless, this dichotomy between power users and newbie users is perhaps overemphasized by our current model, so we moderate the dichotomy by modifying our model to be

$$r_ui = b_ui + |R^k(i; u)|^-0.5 sum_j in R^k(i; u) (r_uj – b_uj) w_ij + |N^k(i; u)|^-0.5 sum_j in N^k(i; u) c_ij.$$

Parameters are determined by solving a regularized least squares problem.

Typical SVD approaches are based on the following rule:

$$r_ui = b_ui + p_u^T q_i,$$

where $p_u$ is a user-factors vector and $q_i$ is an item-factors vector. We describe two enhancements.


One suggestion is to replace $p_u$ with

$$|R(u)|^-0.5 + sum_j in R(u) (r_uj – b_uj) x_j + |N(u)|^-0.5 sum_j in N(u) y_j,$$

where $R(u)$ is the set of items user u has rated, and $N(u)$ is the set of items user u has provided implicit feedback for. In other words, this model represents users through the items they prefer, rather than expressing users in a latent feature space. This model has several advantages:

  • Asymmetric-SVD does not parameterize users, so we do not need to wait to retrain the model when a user comes in. Instead, we can handle new users as soon as they provide feedback.
  • Predictions are a direct function of past feedback, so we can easily explain predictions. (When using a pure latent feature solution, however, explainability is difficult.)

As usual, parameters are learned via a regularized least-squares minimization.


Another approach is to continue modeling users as latent features, while adding implicit feedback. Thus, we replace $p_u$ with $p_u + |N(u)|^-0.5 sum_j in N(u) y_j$. While we lose the easily explainability and immediate feedback of the Asymmetric-SVD model, this approach is likely more accurate.

An integrated model incorporating baseline estimates, the neighborhood approach, and the latent factor approach is as follows:

$$r_ui = left[mu + b_u + b_iright] +left[q_i^T big(p_u + sqrtN(u)sum_j in N(u) y_j big)right] + left[sqrtR^k(i;u) sum_j in R^k(i; u)(r_uj – b_uj)w_ij+sqrt sum_j in N^k(i; u) c_ijright].$$

Note that we have used $(mu + b_u + b_i)$ as our baseline estimate. We also used the SVD++ model, but we could use the Asymmetric-SVD model instead.

This rule provides a 3-tier model for recommendations:

  • The first baseline group describes general properties of the item and user. For example, it may say that “The Sixth Sense” movie is known to be a good movie in general, and that Joe rates like the average user.
  • The next latent factor group may say that since “The Sixth Sense” and Joe rate high on the Psychological Thrillers Scale, Joe may like The Sixth Sense because he likes this genre of movies in general.
  • The final neighborhood tier makes fine-grained adjustments that are hard to file, such as the fact that Joe rated low the movie “Signs”, a similar psychological thriller by the same director.

As usual, model parameters are determined by minimizing the regularized squared error function through gradient descent.