(click to expand)
This project explores how matrix factorization (MF), enhanced with side information, can improve movie recommendations from sparse rating data. We trained an MF model that learns latent user 👤 and item 🎬 representations being guided by metadata such as genres 🎞️ and release years 📅.
Our best configuration reached an RMSE of ≈ 0.86, and a structured hyperparameter tuning followed by an ablation study revealed that:
- 👍 Removing side features degrades performance (+0.22 RMSE), confirming that metadata encode structure the latent factors alone cannot capture.
- 👍 Genre and year features are complementary, each adding small but consistent gains.
- 👎 Graph regularization and popularity-aware shrinkage show negligible impact in this setup.
While none of these effects reach statistical significance (given the limited folds), the findings suggest that simple, well-regularized models can remain interpretable, stable, and competitive. Beyond accuracy, the focus is on building a clean, reproducible workflow designed to serve as a solid baseline for future recommender projects.
(click to expand)
This project follows a four-stage pipeline:
1️⃣ Create & freeze folds - build reproducible train/validation splits
2️⃣ Prepare & normalize features - process metadata such as genres or years
3️⃣ Tune hyperparameters - run Optuna to find the best configuration
4️⃣ Evaluate & ablate - benchmark and interpret model variants
To reproduce everything end-to-end, simply run:
bash scripts/run_all.shThis script executes all four stages in sequence, reproducing the full experiment with the default configuration.
Each module can also be executed independently with your own parameters:
1. Create folds
python -m scripts.create_folds \
--input data/ratings.npy \
--output artifacts/folds/entrywise_folds_seed_<SEED>.npz \
--n_splits <K> \
--seed <SEED>2. Prepare features
python -m scripts.prepare_features \
--genres_path data/genres.npy \
--years_path data/years.npy \
--genres_norm_method <row_l1|col_zscore|none> \
--years_norm_method <row_l1|col_zscore|none> \
--output_path data/features.npz3. Tune hyperparameters
python -m scripts.tune_params \
--R_path data/ratings.npy \
--folds_path artifacts/folds/entrywise_folds_seed_<SEED>.npz \
--features_path data/features.npz \
--study_name <study_name> \
--n_trials <N> \
--seed <SEED>4. Evaluate or ablate models
python -m scripts.evaluate_models \
--R_path data/ratings.npy \
--folds_path artifacts/folds/entrywise_folds_seed_<SEED>.npz \
--best_params_path results/tuning/<study_name>_best_params.json \
--features_path data/features.npz \
--out_dir results/ablations \
--n_pop_bins <N_BINS>Each of the corresponding scripts is (hopefully) well-documented, so feel free to check their internal docstrings and comments in case of any question or if you’d like to adapt them to your own setup.
Imagine building a recommendation system that predicts what movies you might like based on your past ratings and those of similar users. Intuitively, this involves filling in a large matrix where rows represent users, columns represent movies, and the entries are ratings. However, most of this matrix is empty because users only rate a small fraction of movies. So, how do we predict those missing ratings knowing that we have very little data for many users and items?
This is a classic problem in data science known as collaborative filtering. It looks for patterns in user behavior and item characteristics to infer preferences, even when direct ratings are sparse. This repository contains a project that implements this kind of model using matrix factorization techniques, to predict user-item ratings in a sparse matrix.
We aim to predict unobserved user–item ratings in a sparse matrix
The model must generalize to unseen
We measure predictive quality via the Root Mean Squared Error (RMSE):
Our final model represents the predicted rating
where:
-
$U \in \mathbb{R}^{m \times k}$ : user latent factors -
$V \in \mathbb{R}^{n \times k}$ : item latent factors -
$x_{i,f}$ : feature vector for item (i) in feature group (f) (e.g. genre, year) -
$W_f \in \mathbb{R}^{d_f \times k}$ : projection from feature space to latent space -
$\mu$ : global mean; -
$b_u, b_i$ : user and item biases -
$k$ : latent dimension
👉 Details of different regularization terms used in the loss function
(click to expand)
The loss minimized over observed ratings is enriched with several regularization terms to incorporate side information. Some of these regularizations are optional and can be deactivated, but the most complete formula of the loss function is:
where:
-
$\large \textcolor{teal}{\text{User or item shrinkage } (\lambda_u, \lambda_{v,i}):}$ This is the classic$L_2$ penalty that prevents latent factors from growing arbitrarily large. It ensures that users and items with few observations don’t overfit their small amount of data. In our case, the item regularization$\lambda_{v,i}$ can optionally be scaled inversely with the item’s popularity (we trust more the popular items):$\lambda_{v,i} = \frac{\lambda_v}{\sqrt{c_i + 1}},\quad c_i = |{u : (u,i)\in\Omega}|$ . -
$\large \textcolor{olive}{\text{Feature projections } (\lambda_{w_f}):}$ Each$W_f$ projects an item’s raw feature vector (e.g. genre, year) into the same latent space as$V_i$ . The corresponding regularization penalizes large deviations, acting like a prior that discourages the feature mappings from dominating the latent representation. Intuitively, it keeps the learned feature influence “gentle” rather than letting side information completely override collaborative patterns. -
$\large\textcolor{brown}{\text{Bias regularization } (\lambda_{b_u}, \lambda_{b_i}):}$ These terms keep user and item biases (i.e., consistent rating offsets) from absorbing too much variance. Without this constraint, biases could fit individual noise, especially for users who rate few items. In practice, bias regularization improves the stability and convergence speed of ALS updates. -
$\large \textcolor{purple}{\text{Item similarity } (\alpha \mathrm{Tr}(V^\top L V)):}$ This is the graph Laplacian regularization, which enforces that similar items (according to metadata) should have similar embeddings. It minimizes the smoothness term:$\mathrm{Tr}(V^\top L V) = \frac{1}{2} \sum_{i,j} S_{ij}|V_i - V_j|^2$ , where$S$ encodes cosine similarity between items (e.g. shared genres). In practice, it acts like a “soft constraint” that pulls neighboring movies closer in latent space, which improves cold-item generalization when side metadata are available.
Even though the global objective is non-convex, it becomes convex with respect to each parameter group when the others are fixed. This property allows an Alternating Least Squares (ALS) algorithm: at each step, one block of parameters is updated while keeping the rest constant. Each subproblem has a closed-form least-squares solution, which can be solved.
In simple terms:
- We alternate between updating users, items, feature projections, and biases.
- Each update “refits” that part of the model to the residuals left by the others.
- This process quickly drives the model toward a stationary point of the objective.
The updates repeat for a fixed number of iterations, but an early-stopping mechanism is applied to halt training once the validation RMSE stops improving, avoiding overfitting and saving computation.
👉 Detailed update equations
(click to expand)
At each iteration, we update each parameter in turn by solving the following subproblems (with Cholesky decomposition for efficiency):
Each term above corresponds to solving a small, independent least-squares problem. Since all updates have closed-form solutions, the method avoids costly gradient computations, making ALS both stable and interpretable.
From our solution above, we can see that there are several hyperparameters to tune. So, once the model was implemented, we performed a hyperparameter search using Optuna with 3-fold cross-validation on the frozen folds that we created from the dataset. Each trial minimizes the mean validation RMSE across folds. We used a TPE sampler and MedianPruner, with early stopping enabled inside ALS (es_tol = 1e−4, es_min_iters = 10). The optimization ran for 150 trials.
Best validation RMSE: 0.8663
Mean training iterations: 13 (all early-stopped folds)
Features used: genres, years
Graph source: genres (cosine similarity)
Regularization mode: inverse-sqrt popularity scaling
The best model combined feature-based item embeddings, graph regularization, and popularity-aware item shrinkage, suggesting that integrating side information improves generalization.
👉 Details on the hyperparameters and the used search space
(click to expand)
| Hyperparameter | Description | Search Space |
|---|---|---|
n_factors |
Latent dimensionality ( |
[1, 150] |
lambda_u, lambda_v
|
User/item |
[1e−4, 1e4] (log) |
lambda_bu, lambda_bi
|
Bias regularization | [1e−4, 1e4] (log) |
lambda_w_<f> |
Feature regularization (for genres, years) |
[1e−4, 1e4] (log) |
alpha |
Graph regularization strength | [0, 100] |
S_topk |
Graph similarity top-K | [1, 610] |
S_eps |
Graph epsilon cutoff | [1e−10, 1e−4] |
pop_reg_mode |
Item popularity weighting | {None, inverse_sqrt} |
update_w_every |
Frequency of W updates | [1, 60] |
n_iters |
Max iterations | 100 (fixed because of early-stopping) |
Table 1. Hyperparameter search space used for Optuna optimization (150 trials, 3-fold CV).
Since the best model combined all available components, to better understand the contribution of each of them to the overall performance, we performed an ablation study by disabling or altering one component at a time and measuring the impact on performance. All variants were evaluated on the same folds. The main metric is Test RMSE, complemented with per-popularity-bin RMSE to assess cold-item performance. We also report paired sign-test p-values against the baseline as well as FDR-corrected p-values to account for multiple comparisons. The variants tested are summarized in the table below:
| Variant | Description | Disabled / Altered Components |
|---|---|---|
| full | All components enabled (baseline) | – |
| no_features | Remove all side features | |
| only_genres | Keep only genre features | |
| only_years | Keep only year features | |
| no_graph | Disable Laplacian regularization | |
| graph_feature=year | Graph built from year similarity | replace source feature for similarity computation |
| no_pop_reg | Uniform item regularization | pop_reg_mode=None |
Table 2. Description of model variants tested in the ablation study and their altered components.
👉 Full results table
(click to expand)
| Variant | Time | Train RMSE | Test RMSE | Cold Bin RMSE | Popular Bin RMSE | Raw p-value | FDR Corrected p-value |
|---|---|---|---|---|---|---|---|
| full | 37.2270 |
0.7900 |
0.8618 |
0.9541 |
0.8495 |
||
| no_features | 52.3489 |
0.7083 |
1.0834 |
1.0602 |
1.1109 |
0.0625 | 0.1250 |
| only_genres | 36.1364 |
0.7869 |
0.8714 |
0.9831 |
0.8602 |
0.0625 | 0.1250 |
| only_years | 59.9810 |
0.7117 |
1.0034 |
1.0191 |
1.0135 |
0.0625 | 0.1250 |
| no_graph | 26.4977 |
0.7900 |
0.8618 |
0.9541 |
0.8495 |
1.000 | 1.000 |
| graph_feature=year | 42.0569 |
0.7900 |
0.8618 |
0.9541 |
0.8495 |
1.000 | 1.000 |
| no_pop_reg | 47.1860 |
0.7900 |
0.8618 |
0.9541 |
0.8495 |
1.000 | 1.000 |
Table 3. Quantitative results of the ablation study. Reported metrics are mean ± std across folds. “Cold Bin RMSE” refers to items with ≤2 ratings; “Popular Bin RMSE” to items with 15+ ratings.
The most visible effect appears when side features are removed: without genres and years, the test RMSE increases by about +0.22 (Fig.1 (a)), which suggests that these metadata capture structure that the latent factors alone fail to model. Using a single feature weakens performance: only_genres stays close, only_years drops more, meaning that the two features look complementary rather than interchangeable. Disabling the graph term (no_graph) or switching its similarity source (graph_feature=years) produces almost no change in RMSE, but consistently shortens training (Fig.1 (b)), because the construction of the similarity graph takes time with no visible impact on the final performance. The removing popularity-aware shrinkage (no_pop_reg) does not affect RMSE much either, though the idea of popularity-awareness could be further explored (here we only tested uniform vs. inverse-sqrt scaling).
Overall, side features seem to contribute most to predictive accuracy. Looking at per-popularity-bin RMSE (Fig.1 (c)), we see that RMSE mainly drops when years and genres are added, but the other components do not seem to affect the performance even in the cold bin (1–2 ratings).
However, even this small effect does not pass the significance threshold (
Fig. 1: (a) Test RMSE by variant; (b) Training time by variant; (c) Test RMSE per item-popularity bin (mean ± std across folds).
This project implemented a recommender system based on matrix factorization, extended with graph regularization, side-feature projections, and popularity-dependent item shrinkage. The results suggest that feature-based item embeddings contribute most to predictive quality, while graph and popularity terms primarily help with stability and calibration.
Although the method itself is simple, particular care was taken to implement a reliable workflow for model selection, hyperparameter tuning, and ablation analysis. The code was structured with reusability in mind: it should be straightforward to add new features, modify the similarity graph, or test temporal extensions. The project can therefore serve as a solid baseline that can be reused and expanded to serve for more complex models in future work.
[1] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in Neural Information Processing Systems (NeurIPS), MIT Press, 2001, pp. 585–591.
[2] Y. Li, J. Lee, and Y. Lee, “Matrix factorization with graph regularization for collaborative filtering,” Proceedings of the 24th ACM international conference on information and knowledge management, pp. 79–88, 2015.


