- The paper systematically reviews neural recommendation models categorized by collaborative filtering, content-enriched, and temporal approaches.
- The paper highlights innovative methodologies like attention mechanisms, GNNs, and autoencoders to improve user-item representation and interaction modeling.
- The paper discusses future directions including benchmarking, multi-objective optimization, and enhanced reproducibility for robust recommender systems.
This survey, "A Survey on Accuracy-oriented Neural Recommendation: From Collaborative Filtering to Information-rich Recommendation" (2104.13030), provides a systematic review of neural recommender models, focusing on how different types of data are used to improve recommendation accuracy. It categorizes these models based on their data usage into three main types: collaborative filtering, content-enriched recommendation, and temporal/sequential recommendation.
The core problem in recommendation is framed as learning a prediction function y^u,i,c=f(Du,Di,Dc), which estimates the likelihood of user u favoring item i under context c, given data Du (describing the user), Di (describing the item), and Dc (describing the context).
1. Collaborative Filtering (CF) Models
CF models primarily leverage user-item interaction data, effectively ignoring Dc and using only IDs or interaction history for Du and Di. The development in neural CF is divided into representation learning and interaction modeling.
A. Representation Learning
The goal is to learn user embeddings (P) and item embeddings (Q).
- History Behavior Attention Aggregation Models: These models improve upon classical latent factor models (which use free embeddings for user/item IDs) by incorporating a user's interaction history. Instead of simple pooling (like in FISM) or adding ID embeddings (like SVD++), attention mechanisms assign different weights to historical items.
Attentive Collaborative Filtering (ACF): Assigns user-aware attentive weights to historical items. The user representation is a sum of their ID embedding and an attention-weighted sum of their interacted item embeddings.
r^ui=(pu+j∈Ru∑α(u,j)qj)Tqi
where α(u,j)=∑j′∈Ruexp(F(pu,qj′))exp(F(pu,qj)). F(⋅,⋅) can be an MLP or inner product.
Neural Attentive Item Similarity (NAIS): Makes the attention target item-aware, meaning the influence of a historical item depends on the item being predicted.
r^ui=(j∈Ru∑α(i,j)qj)Tqi
α(i,j)=[∑k∈Ruexp(F(qi,qk))]βexp(F(qi,qj)).
- Autoencoder based Representation Learning: These models use autoencoders to learn latent representations by reconstructing the input (e.g., a user's interaction vector). Variants include denoising autoencoders (CDAE) and variational autoencoders (Mult-VAE). Some models use parallel encoders for users and items.
- Graph based Representation Learning: User-item interactions are viewed as a bipartite graph. Graph Neural Networks (GNNs) are used to learn embeddings by propagating information from neighbors.
The (l+1)th order user embedding pu(l+1) is updated by aggregating its connected items' lth order embeddings qjl:
au(l+1)=Agg(qjl∣j∈Ru)
pu(l+1)=ρ(Wl[pul,au(l+1)])
Models like GC-MC and NGCF use graph convolutions. Simpler models like LightGCN remove non-linearities and transformations, often achieving strong performance by focusing on neighborhood aggregation.
B. Interaction Modeling
This component estimates the preference score r^ui from the learned user (pu) and item (qi) embeddings.
- Inner Product: The most common method: r^ui=puTqi. It's efficient but can be limited by the triangle inequality violation and its linear nature.
- Distance based Metrics: Address the triangle inequality.
- CML: Minimizes Euclidean distance: dui=∣∣pu−qi∣∣22.
- TransRec: Uses a translation principle for sequential behavior: qj+pu≈qi (user u translates from next item j to current item i).
- LRML: Introduces relation vectors e learned via attention over a memory matrix: sui=∣∣pu+e−qi∣∣F2.
- Neural Network based Metrics: Capture complex, non-linear interactions.
- NCF: Uses an MLP on concatenated embeddings: r^ui=fMLP(pu∣∣qi). It often combines this with a generalized matrix factorization (GMF) component (inner product).
- CNN-based: Use outer product of embeddings to create an interaction map, then apply CNNs (e.g., ONCF).
- Autoencoder-based: The decoder part directly reconstructs the interaction matrix (e.g., AutoRec).
2. Content-enriched Recommendation
These models incorporate auxiliary information (side information) associated with users and items, such as profiles, social networks, item attributes (text, images), and knowledge graphs.
A. Modeling General Feature Interactions
These models focus on categorical or numerical features often found in CTR prediction.
- Factorization Machines (FM): A baseline that models second-order feature interactions efficiently: y^x=w0+∑wdxd+∑∑xdxd′⟨vd,vd′⟩.
- MLP based High Order Modeling: Embed features, then use MLPs to implicitly learn high-order interactions (e.g., NFM, DeepCrossing). Wide & Deep models combine these deep MLP paths with shallow, linear paths.
- Cross Network for K-th Order Modeling: Explicitly model feature interactions up to a defined order K (e.g., DCN, xDeepFM). DCN uses a cross layer: xk=x0xk−1Twk+bk+xk−1.
- Tree Enhanced Modeling: Use decision trees to extract explicit cross-features, then feed their embeddings into an attention model (e.g., TEM).
B. Modeling Textual Content
Leverages NLP techniques for item descriptions, user reviews, etc.
- Autoencoder based Models: Use autoencoders (e.g., Stacked Denoising Autoencoders in CDL) to learn item content representations. The item embedding qi can be a combination of content-derived representation and a free latent vector: qi=fe(xi)+θi.
- Leveraging Word Embeddings for Recommendation: Use pre-trained or jointly trained word embeddings with models like CNNs or RNNs.
- ConvMF: Integrates TextCNN into probabilistic matrix factorization to derive item embeddings from text. Item latent vector qi is drawn from a Gaussian centered around TextCNN(W,xi).
- DeepCoNN: Uses two parallel TextCNNs to model user reviews and item reviews, then a Factorization Machine for interaction: r^ui=FM(TextCNN(Du),TextCNN(Di)).
- Attention Models: Assign weights to different parts of text (words, sentences, aspects) to create more informative representations.
- Text Explanations for Recommendation:
- Extraction-based: Select important text pieces (e.g., via attention weights) as explanations.
- Generation-based: Generate natural language explanations using encoder-decoder architectures (e.g., NRT predicts ratings and generates reviews simultaneously).
C. Modeling Multimedia Content
Utilizes visual (images, videos) and audio information.
- Image Information:
- Content-based: Extract visual features using CNNs, then project users and items into this visual space.
- Hybrid Models: Combine CF signals with visual features.
- VBPR (Visual Bayesian Personalized Ranking): Extends BPR by incorporating visual features. The preference score is a sum of collaborative and visual preference: r^ui=puTqi+wuTf(CNN(xi)), where f(CNN(xi)) is the item's visual representation and wu is the user's visual preference vector.
- GNNs: Model relationships in item-item graphs where nodes have visual features (e.g., PinSage).
- Video Recommendation: Often involves extracting frame-level features, then using attention or RNNs to aggregate them. Audio features can also be incorporated using fusion techniques. ACF uses attention with visual inputs for multimedia.
D. Modeling Social Network
Exploits social connections (trust, friendship) assuming social influence affects preferences.
- Social Correlation Enhancement and Regularization: User embedding pu is a fusion of an item domain embedding eu and a social embedding hu derived from social connections: pu=f(eu,g(u,S)). Social structure can also act as a regularizer, encouraging connected users to have similar embeddings.
- GNN Based Approaches: Model the social diffusion process more explicitly.
- DiffNet: Simulates recursive social influence. User embedding huk at diffusion step k combines their previous embedding huk−1 with aggregated influence from social neighbors hSu(k−1).
E. Modeling Knowledge Graph (KG)
Leverages structured knowledge about items and their attributes (e.g., movie -[has_director]-> director).
- Path Based Methods: Exploit paths (sequences of entities and relations) between users and items in the KG to infer preferences. Models like KPRN embed paths and pool them. RippleNet constructs "ripple sets" (multi-hop KG neighbors) for users.
- Regularization Based Methods: Use KG embedding (KGE) techniques (e.g., TransE, TransR) to learn entity representations. The KGE loss acts as a regularizer for the recommendation model.
- CKE (Collaborative Knowledge Base Embedding): Item embedding is a sum of its ID embedding and KGE-derived embedding: qi=fEmbed(i)+fKGE(i∣G).
- GNN Based Methods: Apply GNNs to a "collaborative knowledge graph" (user-item graph + KG).
- KGAT (Knowledge Graph Attention Network): Recursively propagates embeddings on this unified graph, using attention to weigh neighbor contributions. User embedding pu=fGNN(u,G).
3. Temporal/Sequential Models
These models account for the dynamic nature of user preferences and the order of interactions.
- Temporal based recommendation: Focuses on the timestamp of interactions [u,i,rui,tui] to model evolving preferences.
RRN (Recurrent Recommender Networks): Use RNNs (e.g., LSTMs) to model the evolution of user (put) and item (qit) dynamic embeddings over time:
put=RNN(pu(t−1),Wxut)
qit=RNN(qi(t−1),Wxit)
where xut is the user's rating vector in the current time interval.
Memory Networks: Use external memory components to store and update user historical states, aiming to capture long-term dependencies better than standard RNNs.
- Session based recommendation: Models sequences of item interactions within a session [i1,i2,...,i∣S∣], often for anonymous users.
- GRU4Rec: Uses GRUs to predict the next item in a session based on preceding items.
- Translation-based models: Model transitions like qi+pu≈qj (user u translates from item i to next item j). (Note: This formula seems reversed compared to TransRec in CF section, check paper for consistency - paper actually uses qj+pu≈qi for TransRec, so this would be predicting previous item or a typo). The survey paper uses qi+pu≈qj in the TransRec citation, where j is the next item.
- Self-Attention Models (e.g., SASRec): Use self-attention to capture dependencies between all items in a sequence directly, without recurrence.
- GNNs (e.g., SR-GNN): Construct a graph from all session sequences (nodes are items, edges represent co-occurrence or transitions). GNNs then learn item embeddings from this graph structure.
- Temporal and session based recommendation: Combines user identity with temporal sequences of sessions [u,s,t].
- Hierarchical Models: Often use two levels of RNNs or attention: one to model item interactions within a session (short-term interest), and another to model session evolution for a user over time (long-term interest). SHAN uses hierarchical attention.
- CNN-based Models (e.g., Caser): Treat the sequence of recent items/sessions as an "image" and apply 2D convolutions to capture local sequential patterns.
- GNN-based Models: Construct dynamic graphs or hypergraphs evolving over time (e.g., HyperRec).
Discussion and Future Directions
- Recommendation Benchmarking: Need for standardized datasets and evaluation protocols to reliably track progress.
- Graph Reasoning & Self-supervised Learning: Leveraging GNNs for complex relational data and using self-supervised tasks to pre-train or augment representation learning for sparsity issues.
- Multi-Objective Goals for Social Good: Moving beyond accuracy to consider fairness, diversity, explainability, and multi-stakeholder satisfaction.
- Reproducibility: Acknowledges challenges in reproducing results due to sensitivity to hyperparameters, dataset splits, evaluation metrics, and calls for transparency and robust evaluation.
This survey provides a comprehensive roadmap of how neural networks have been applied to various recommendation scenarios, emphasizing the modeling of different data sources to enhance predictive accuracy. It highlights common techniques like attention mechanisms, GNNs, RNNs, and autoencoders, and their adaptations for specific recommendation tasks.