A Comprehensive Review on Non-Neural Networks Collaborative Filtering Recommendation Systems (2106.10679v2)

Published 20 Jun 2021 in cs.IR, cs.AI, and cs.LG

Abstract: Over the past two decades, recommender systems have attracted a lot of interest due to the explosion in the amount of data in online applications. A particular attention has been paid to collaborative filtering, which is the most widely used in applications that involve information recommendations. Collaborative filtering (CF) uses the known preference of a group of users to make predictions and recommendations about the unknown preferences of other users (recommendations are made based on the past behavior of users). First introduced in the 1990s, a wide variety of increasingly successful models have been proposed. Due to the success of machine learning techniques in many areas, there has been a growing emphasis on the application of such algorithms in recommendation systems. In this article, we present an overview of the CF approaches for recommender systems, their two main categories, and their evaluation metrics. We focus on the application of classical Machine Learning algorithms to CF recommender systems by presenting their evolution from their first use-cases to advanced Machine Learning models. We attempt to provide a comprehensive and comparative overview of CF systems (with python implementations) that can serve as a guideline for research and practice in this area.

Authors (5)

Carmel Wenga (1 paper)
Majirus Fansi (1 paper)
Sébastien Chabrier (2 papers)
Jean-Martial Mari (7 papers)
Alban Gabillon (4 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper presents a comprehensive review of non-neural collaborative filtering methods, detailing both memory-based and model-based approaches.
It demonstrates that techniques like ratings normalization and explainability significantly boost prediction accuracy, with EMF achieving the lowest MAE.
The review offers practical Python implementations and comparative experiments on benchmark datasets, guiding both practitioners and researchers.

This paper, "A Comprehensive Review on Non-Neural Networks Collaborative Filtering Recommendation Systems" (Wenga et al., 2021 ), provides an extensive overview of classical (non-neural network) collaborative filtering (CF) techniques, their evolution, practical implementations, and evaluation. It aims to serve as a guide for practitioners and researchers by detailing various algorithms, offering Python implementations, and comparing their performance on benchmark datasets.

The review begins by introducing collaborative filtering as a process where the preferences of a group of users are used to predict the unknown preferences of others. User preferences, often represented as ratings in a user-item matrix, form the basis for these predictions.

Memory-Based Collaborative Filtering

Memory-based CF algorithms directly use the entire user-item interaction matrix to make predictions. They are primarily categorized into:

User-based CF:
- Concept: Identifies users similar to an active user based on their rating patterns. Recommendations are then generated from items liked by these similar users but not yet rated by the active user.
- Similarity Computation: Common metrics include:
  - Pearson Correlation: Measures the linear relationship between the ratings of two users on co-rated items.
    
    $W_{u,v} = \frac{\sum_{i \in I} (R_{u,i} - \bar{R}_u) (R_{v,i} - \bar{R}_v)}{\sqrt{\sum_{i \in I} (R_{u,i} - \bar{R}_u)^2} \sqrt{\sum_{i \in I} (R_{v,i} - \bar{R}_v)^2}}$
    
    where $I$ is the set of co-rated items, $R_{u,i}$ is user $u$ 's rating on item $i$ , and $\bar{R}_u$ is user $u$ 's average rating on co-rated items.
  - Cosine Similarity: Measures the cosine of the angle between two user rating vectors.
    
    $W_{u,v} = \frac{\sum_{i \in I} R_{u,i} R_{v,i}}{\sqrt{\sum_{i \in I} (R_{u,i})^2} \sqrt{\sum_{i \in I} (R_{v,i})^2}}$

* Prediction: Predicted rating $\hat{R}_{u,i}$ for user $u$ on item $i$ is often a weighted average:

$\hat{R}_{u,i} = \bar{R}_u + \frac{\sum_{v \in N_u} (R_{v,i} - \bar{R}_v) \cdot W_{u,v}}{\sum_{v \in N_u} |W_{u,v}|}$

where $N_u$ is the set of neighbors of user $u$ who have rated item $i$ . * Top-N Recommendation: Identifies the $k$ most similar users and recommends the $N$ most frequent/highly-rated items among them that the active user hasn't interacted with.

Item-based CF:
- Concept: Identifies items similar to those an active user has liked in the past and recommends those similar items.
- Similarity Computation: Similar metrics are used, but applied to item vectors (columns of the user-item matrix).
  - Adjusted Cosine Similarity: Addresses the issue that different users have different rating scales by subtracting each user's average rating from their ratings before computing cosine similarity between items.
    
    $W_{i,j} = \frac{\sum_{u \in U} (R_{u,i} - \bar{R}_u) (R_{u,j} - \bar{R}_u)}{\sqrt{\sum_{u \in U} (R_{u,i} - \bar{R}_u)^2} \sqrt{\sum_{u \in U} (R_{u,j} - \bar{R}_u)^2}}$
    
    where $U$ is the set of users who rated both items $i$ and $j$ .
- Prediction: Predicted rating $\hat{R}_{u,i}$ for user $u$ on item $i$ is a weighted average of user $u$ 's ratings on items similar to $i$ :
  
  $\hat{R}_{u,i} = \frac{\sum_{j \in S(i)} R_{u,j} \cdot W_{i,j}}{\sum_{j \in S(i)} |W_{i,j}|}$
  
  where $S(i)$ is the set of items similar to item $i$ that user $u$ has rated.
- Top-N Recommendation: For items $I_u$ purchased by user $u$ , candidate items $C$ are formed by taking the union of $k$ most similar items for each item in $I_u$ (excluding items already in $I_u$ ). Similarities are aggregated, and items are sorted to get the top-N.

Implementation Considerations for Memory-based CF:

Sparsity: A major challenge, as similarity scores can be unreliable or undefined if there are few co-rated items/users. Imputation techniques can be used but might introduce bias.
Scalability: User-based CF can be computationally expensive for large datasets as neighborhood search happens at runtime. Item-based CF often scales better because item-item similarities can be pre-computed offline.
Cold Start: Difficulty in making recommendations for new users or new items with no interaction data.

The paper notes that the authors provide Python/Numpy/Pandas implementations for these models on GitHub.

Model-Based Collaborative Filtering

Model-based CF methods learn a model from the user-item interactions, which is then used for predictions. This review focuses on dimensionality reduction techniques.

Singular Value Decomposition (SVD):
- Concept: Decomposes the $m \times n$ rating matrix $R$ into $R = P \Sigma Q^T$ , where $P$ and $Q$ are orthogonal matrices representing user and item latent factors, and $\Sigma$ is a diagonal matrix of singular values. Dimensionality is reduced by keeping only the $k$ largest singular values ( $\Sigma_k$ ).
- Prediction: $\hat{R}_k = P_k \Sigma_k Q_k^T$ . The predicted rating for user $u$ on item $i$ is $\hat{R}_{u,i} = p_u^T \sqrt{\Sigma_k} \sqrt{\Sigma_k} q_i$ .
- Implementation: Requires imputing missing values in $R$ (e.g., with item means). Normalizing ratings (e.g., by subtracting user means) can improve accuracy.
- Algorithm Steps:
1. Normalize rating matrix $R \rightarrow R_{norm}$ . 2. Factor $R_{norm}$ to get $P, \Sigma, Q$ . 3. Reduce $\Sigma$ to $\Sigma_k$ . 4. Compute $P_k \sqrt{\Sigma_k}$ and $\sqrt{\Sigma_k} Q_k^T$ for predictions.
Matrix Factorization (MF) / Regularized SVD:
- Concept: Directly learns latent factor vectors $P_u \in \mathbb{R}^k$ for each user $u$ and $Q_i \in \mathbb{R}^k$ for each item $i$ . The predicted rating $\hat{R}_{u,i} = Q_i^T P_u$ .
- Learning: Minimizes a regularized squared error cost function over known ratings:
  
  $J(P,Q) = \frac{1}{2} \sum_{(u,i) \in K} (R_{u,i} - Q_i^T P_u)^2 + \frac{\lambda}{2} (||P_u||^2 + ||Q_i||^2)$
  
  where $K$ is the set of known ratings and $\lambda$ is the regularization parameter.
- Optimization: Typically uses Stochastic Gradient Descent (SGD) with update rules:
  
  $e_{u,i} = R_{u,i} - Q_i^T P_u$
  
  $Q_i \leftarrow Q_i + \alpha (e_{u,i} P_u - \lambda Q_i)$
  
  $P_u \leftarrow P_u + \alpha (e_{u,i} Q_i - \lambda P_u)$
  
  where $\alpha$ is the learning rate.
- Advantage: Handles missing values directly without imputation, often more accurate than traditional SVD.
Probabilistic Matrix Factorization (PMF):
- Concept: A probabilistic approach where ratings are assumed to be drawn from a Gaussian distribution with mean $Q_i^T P_u$ . Gaussian priors are placed on $P_u$ and $Q_i$ .
- Likelihood: $Pr(R|P,Q,\sigma^2) = \prod_{u=1}^m \prod_{i=1}^n [\mathcal{N}(R_{u,i} | P_u^T Q_i, \sigma^2)]^{I_{u,i}}$
- Learning: Maximizes the log-posterior, equivalent to minimizing a sum-of-squared-errors objective similar to MF but with potentially different regularization terms $\lambda_P, \lambda_Q$ derived from prior variances:
  
  $J(P,Q) = \frac{1}{2} \sum_{(u,i) \in K} (R_{u,i} - Q_i^T P_u)^2 + \frac{\lambda_P}{2} ||P_u||_{Frob}^2 + \frac{\lambda_Q}{2} ||Q_i||_{Frob}^2$
Non-negative Matrix Factorization (NMF):
- Concept: Constrains the elements of factor matrices $P$ and $Q$ to be non-negative ( $P \ge 0, Q \ge 0$ ). This allows for a parts-based representation and more interpretable latent factors.
- Interpretation: $P_{u,l}$ can represent the probability user $u$ belongs to group $l$ , and $Q_{i,l}$ the probability users in group $l$ like item $i$ .
- Learning: Uses multiplicative update rules to maintain non-negativity while minimizing a similar objective function to PMF/MF.
  
  $P_{u,l} \leftarrow P_{u,l} \frac{\sum_{i \in I_u} Q_{i,l} R_{u,i}}{\sum_{i \in I_u} Q_{i,l} \hat{R}_{u,i} + \lambda_P |I_u| P_{u,l}}$
  
  $Q_{i,l} \leftarrow Q_{i,l} \frac{\sum_{u \in U_i} P_{u,l} R_{u,i}}{\sum_{u \in U_i} P_{u,l} \hat{R}_{u,i} + \lambda_Q |U_i| Q_{i,l}}$
  
  (Note: The paper's equations 19 and 20 for NMF updates are slightly different, this is a common form). The paper uses $\hat{R}_{u,i}$ in the denominator, which is $P_u Q_i^T$ .
Explainable Matrix Factorization (EMF):
- Concept: Incorporates neighborhood-based explanations into the MF model to improve accuracy and provide justifications. An item $i$ is explainable for user $u$ if many of $u$ 's neighbors rated $i$ .
- Explainability Score (User-based): $Expl_{u,i} = E(R_{v,i} | N_u) = \sum_x x \cdot Pr(R_{v,i} = x | v \in N_u)$ .
- Explanation Weight: $W_{u,i}$ is derived from $Expl_{u,i}$ , thresholded to indicate significant explainability.
- Objective Function: Adds an explainability regularization term to the MF objective:
  
  $J(P,Q) = \sum_{(u,i) \in K} (R_{u,i} - \hat{R}_{u,i})^2 + \beta (||P_u||^2 + ||Q_i||^2) + \lambda \sum_{(u,i) \in K} (P_u - Q_i)^2 W_{u,i}$
  
  (The paper's EMF objective structure (Eq. 25) for the third term is $(P_u - Q_i)^2 W_{u,i}$ , which appears unusual; typically the term would relate to how well the model aligns with the explainability score, or how similar latent factors of explainable items/users are).

Limitations of MF techniques:

Large number of parameters, potentially leading to overfitting and slow training.
Making predictions for new users/items requires re-optimization.
Linear transformations may not capture complex non-linear patterns.

Evaluation Metrics

The paper categorizes evaluation metrics as:

Prediction Accuracy: For rating prediction tasks.
- Mean Absolute Error (MAE): $\frac{1}{|T|} \sum_{(u,i) \in T} |R_{u,i} - \hat{R}_{u,i}|$
- Root Mean Squared Error (RMSE): $\sqrt{\frac{1}{|T|} \sum_{(u,i) \in T} (R_{u,i} - \hat{R}_{u,i})^2}$
- Coverage: Percentage of items for which the system can provide predictions.
Quality of Set of Recommendations: For evaluating relevance of a recommended set.
- Precision@N: $\frac{|\text{Relevant Recommended Items}|}{N}$
- Recall@N: $\frac{|\text{Relevant Recommended Items}|}{|\text{Total Relevant Items}|}$
- F1-score@N: Harmonic mean of Precision and Recall.
Quality of List of Recommendations: Considers the ranking of recommended items.
- Mean Average Precision (MAP): Mean of average precision scores over all users.
- Half-life Utility Rate: Assumes exponential decay in user interest down the list.
- Discounted Cumulative Gain (DCG): Assigns higher value to relevant items at the top, with logarithmic decay. $DCG_k = \sum_{i=1}^k \frac{rel_i}{\log_2(i+1)}$ . Normalized DCG (nDCG) is often preferred.
Novelty and Diversity:
- Novelty: Measures how new or surprising recommended items are to the user.
  
  $novelty_i = \frac{1}{|Z_u|-1} \sum_{j \in Z_u, j \neq i} (1 - sim(i,j))$

* Diversity: Measures how different items within a recommendation list are from each other.

$diversity_{Z_u} = \frac{1}{|Z_u|(|Z_u|-1)} \sum_{i \in Z_u} \sum_{j \in Z_u, j \neq i} (1 - sim(i,j))$

Comparative Experimentation

Experiments were conducted on MovieLens ML-100K and ML-1M datasets using MAE.

User-based vs. Item-based CF:
- Cosine similarity generally outperformed Euclidean distance for both.
- Item-based CF showed lower MAE than User-based CF (e.g., on ML-1M with Cosine: 0.42 for Item-based vs. 0.73 for User-based). This supports the idea that item-item similarities are more stable.
Importance of Ratings Normalization:
- For MF, normalizing ratings (e.g., by subtracting user mean) significantly reduced MAE (e.g., from ~1.48 to ~0.82 on ML-1M, a ~45% reduction).
- NMF cannot be trained on standard normalized ratings if they become negative, due to the non-negativity constraint.
- EMF showed less difference between raw and normalized ratings, suggesting its explainability component helps handle biases.
Performance of MF, NMF, EMF (on raw ratings, k=10, 10 epochs):
- EMF achieved the lowest MAE (e.g., ~0.76 on ML-1M).
- NMF performed better than MF (e.g., NMF ~0.9567 vs. MF ~1.482 on ML-1M).
- The ranking was EMF > NMF > MF, attributed to the benefits of explainability (for EMF) and interpretable non-negative factors (for NMF).

Conclusion and Future Work

The review concludes that memory-based methods are simple but struggle with sparsity and scalability. Model-based methods, particularly MF and its variants (NMF, EMF), address these issues by learning latent factors. Experimentally, EMF showed the best performance, followed by NMF, then MF. Normalization is crucial for MF.

The authors propose a future research direction: Non-negative Explainable Matrix Factorization (NEMF), hypothesizing that combining NMF's interpretable non-negative factors with EMF's explicit explainability mechanism could further improve performance and provide two-stage explanations.

Resources

The paper highlights the availability of Python implementations for all discussed models on GitHub (https://github.com/nzhinusoftcm/review-on-collaborative-filtering), including Jupyter notebooks that can be run on Google Colaboratory. This is a key practical contribution for developers looking to implement these CF techniques.

This review serves as a valuable practical guide by:

Clearly explaining various non-neural CF algorithms.
Discussing their mathematical foundations and update rules.
Providing insights into their strengths and weaknesses.
Detailing relevant evaluation metrics for different recommendation goals.
Presenting comparative experimental results on standard datasets.
Offering open-source code for hands-on implementation.

PDF Markdown