Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pre-training Recommender Systems via Reinforced Attentive Multi-relational Graph Neural Network (2111.14036v1)

Published 28 Nov 2021 in cs.IR

Abstract: Recently, Graph Neural Networks (GNNs) have proven their effectiveness for recommender systems. Existing studies have applied GNNs to capture collaborative relations in the data. However, in real-world scenarios, the relations in a recommendation graph can be of various kinds. For example, two movies may be associated either by the same genre or by the same director/actor. If we use a single graph to elaborate all these relations, the graph can be too complex to process. To address this issue, we bring the idea of pre-training to process the complex graph step by step. Based on the idea of divide-and-conquer, we separate the large graph into three sub-graphs: user graph, item graph, and user-item interaction graph. Then the user and item embeddings are pre-trained from user and item graphs, respectively. To conduct pre-training, we construct the multi-relational user graph and item graph, respectively, based on their attributes. In this paper, we propose a novel Reinforced Attentive Multi-relational Graph Neural Network (RAM-GNN) to the pre-train user and item embeddings on the user and item graph prior to the recommendation step. Specifically, we design a relation-level attention layer to learn the importance of different relations. Next, a Reinforced Neighbor Sampler (RNS) is applied to search the optimal filtering threshold for sampling top-k similar neighbors in the graph, which avoids the over-smoothing issue. We initialize the recommendation model with the pre-trained user/item embeddings. Finally, an aggregation-based GNN model is utilized to learn from the collaborative relations in the user-item interaction graph and provide recommendations. Our experiments demonstrate that RAM-GNN outperforms other state-of-the-art graph-based recommendation models and multi-relational graph neural networks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiaohan Li (33 papers)
  2. Zhiwei Liu (114 papers)
  3. Stephen Guo (15 papers)
  4. Zheng Liu (312 papers)
  5. Hao Peng (291 papers)
  6. Philip S. Yu (592 papers)
  7. Kannan Achan (45 papers)
Citations (13)

Summary

This paper, "Pre-training Recommender Systems via Reinforced Attentive Multi-relational Graph Neural Network" (Li et al., 2021 ), proposes a novel framework to address challenges in applying Graph Neural Networks (GNNs) to recommender systems with complex, multi-relational data. The core idea is to leverage a pre-training step on attribute-based user and item graphs before fine-tuning on the user-item interaction graph, mitigating issues like over-smoothing and effectively incorporating side information.

The framework divides the complex graph into three sub-graphs: a multi-relational user graph, a multi-relational item graph, and a user-item interaction graph. User and item embeddings are first pre-trained using a specialized GNN model, called Reinforced Attentive Multi-relational Graph Neural Network (RAM-GNN), on their respective attribute graphs. These pre-trained embeddings are then used to initialize the node embeddings in a standard aggregation-based GNN model trained on the user-item interaction graph for the final recommendation task.

RAM-GNN for Pre-training

The RAM-GNN model is designed to learn embeddings from multi-relational graphs where multiple relation types and values can exist between a pair of nodes (e.g., two movies sharing the same director and the same genre). The paper models these connections as quadruplets (i,t,v,j)(i, t, v, j), representing head item ii, relation type tt, relation value vv, and tail item jj.

RAM-GNN incorporates two key modules:

  1. Relation-level Attention: This module learns the importance of different relation types for a given node pair. It uses a self-attention mechanism to compute relation-specific attention weights αjn\alpha_{jn} based on the embeddings of the tail node (eje_j), the relation type (ete_t), and the relation value (eve_v). The relation type and value embeddings are concatenated (en=etev\mathbf{e}_n = \mathbf{e}_t \parallel \mathbf{e}_v) before being used in the attention calculation and subsequent composition/aggregation. The relation-specific attention weights αjn\alpha_{jn} are then applied to the output of a composition operation ϕ(ej,en)\phi(\mathbf{e}_j, \mathbf{e}_n) to weigh its contribution to the head node's embedding. The paper explores three composition operations for ϕ\phi: addition, multiplication, and circular-correlation, finding circular-correlation slightly superior in experiments, although all three perform similarly.

    $\mathbf{a}_{jn} = \mathbf{p}^T \sigma(\mathbf{W}_{\text{key} \mathbf{e}_j + \mathbf{W}_{\text{qry}\mathbf{e}_n + \mathbf{b}) \ \mathbf{\alpha}_{jn} = \text{Softmax}(\mathbf{a}_{jn}) \ \mathbf{e}_i = \mathbf{\alpha}_{jn}\mathbf{W}_{\text{val}\phi (\mathbf{e}_j, \mathbf{e}_n)$

  2. Reinforced Neighbor Sampler (RNS): This module addresses the uneven distribution of neighbors across different relation types (e.g., many items sharing a genre, few sharing a specific director). Aggregating all neighbors can lead to over-smoothing. RNS uses a Reinforcement Learning (RL) process, formulated as a Bernoulli Multi-armed Bandit (BMAB), to learn an optimal filtering threshold (ktk_t) for each relation type tt. For a given node ii, RNS samples only the top-ktk_t neighbors jj based on a learned similarity score s(ei,ej)s(\mathbf{e}_i, \mathbf{e}_j), which is derived from the distance d(ei,ej)d(\mathbf{e}_i, \mathbf{e}_j) calculated by an MLP on their embeddings:

    d(ei,ej)=σ(MLP(ei))σ(MLP(ej))1 s(ei,ej)=1d(ei,ej)d(\mathbf{e}_i, \mathbf{e}_j) = \parallel \sigma (MLP(\mathbf{e}_i)) - \sigma(MLP(\mathbf{e}_{j})) \parallel_1 \ s(\mathbf{e}_i, \mathbf{e}_j) = 1 - d(\mathbf{e}_i, \mathbf{e}_j)

    The RL agent adjusts ktk_t to maximize a reward function that balances minimizing the Average Neighbor Distance (AND) among sampled neighbors and maximizing the number of sampled neighbors. This adaptive sampling prunes irrelevant connections.

The RAM-GNN layer aggregates information from the filtered neighbors based on the composed embeddings and relation-level attention:

$\mathbf{e}^{(l)}_i = \sigma \Big( \mathbf{\alpha}_{jn}\mathbf{W}^{(l)}_{\text{val} \big( \underset{j\in \mathcal{N}_{k_t}(i)}{\bm{Aggr} \big(\phi (\mathbf{e}^{(l-1)}_j, \mathbf{e}^{(l-1)}_n) \big) \oplus \mathbf{e}^{(l-1)}_i \big) \Big)$

The pre-training is unsupervised, minimizing a combined loss function Lfinal\mathcal{L}_{final} that includes a GNN loss LGNN\mathcal{L}_{GNN} (similar to the similarity loss but over all nodes) and the similarity loss Lsim\mathcal{L}_{sim} for training the distance metric used in RNS, plus regularization.

Recommendation Fine-tuning

After pre-training RAM-GNN on the user and item attribute graphs to obtain eu,ei\mathbf{e}_u, \mathbf{e}_i and relation value embeddings ev\mathbf{e}_v, these are used to initialize a separate GNN model on the user-item interaction graph. This second GNN focuses on collaborative signals.

The item representation xi\mathbf{x}_i in this phase is formed by concatenating the pre-trained item embedding ei\mathbf{e}_i and its relation value embeddings ev\mathbf{e}_v from the attribute graph, followed by a linear transformation:

xi(0)=W3(eiev)\mathbf{x}^{(0)}_i = \mathbf{W}_3(\mathbf{e}_i \parallel \mathbf{e}_v)

Similarly, the user embedding is initialized from its pre-trained embedding eu\mathbf{e}_u. The GNN layers then propagate information through the user-item interaction graph. User embeddings aggregate neighbor item embeddings, and item embeddings aggregate neighbor user embeddings, refining their representations based on collaborative behavior:

$\mathbf{x}^{(l)}_u = \sigma \Big( \mathbf{W}^{(l)}_4 ( \underset{i \in \mathcal{N}(u)}{\bm{Aggr}(\mathbf{x}^{(l-1)}_i) \oplus \mathbf{x}^{(l-1)}_u) \Big) \ \mathbf{x}^{(l)}_i = \sigma \Big( \mathbf{W}^{(l)}_4 ( \underset{u \in \mathcal{N}(i)}{\bm{Aggr}(\mathbf{x}^{(l-1)}_u) \oplus \mathbf{x}^{(l-1)}_i) \Big)$

The final embeddings xu,xi\mathbf{x}_u, \mathbf{x}_i are used to predict user-item interaction scores y^u,i=xiTxu\hat{y}_{u,i} = \mathbf{x}_i^T \mathbf{x}_u, and the model is trained using the BPR loss:

L(yu,i,y^u,i)=lnσ(yu,iy^u,i)+λ2Θ2\mathbb{L}(y_{u,i},\hat{y}_{u,i}) = - \ln{\sigma (y_{u,i} - \hat{y}_{u,i})} + \lambda_2 \|\Theta_2\|

Implementation Considerations and Practical Implications

  1. Data Preparation: A crucial step is constructing the multi-relational user and item graphs from available attribute data. This involves defining relation types and values and creating edges between nodes that share these attributes. For instance, connecting movies by "share_director" with the director's name as the relation value. User graphs can be constructed based on shared demographics like age or occupation.
  2. Computational Cost: While the RL-based RNS adds complexity during training initialization, its purpose is to improve efficiency and quality after the optimal ktk_t is found by pruning irrelevant neighbors. Training the distance MLP and the RL agent requires additional computation. The pre-training and fine-tuning steps are separate, which might allow for distributed training but require careful management of embeddings passed between phases.
  3. Scalability: The pre-training on attribute graphs might be more scalable than direct GNNs on a massive combined knowledge graph. The RNS module aims to limit the number of neighbors aggregated, which is critical for scaling GNNs to dense graphs with high-degree nodes. However, generating and storing the potentially numerous quadruplets (i,t,v,j)(i, t, v, j) could be demanding for datasets with rich attributes.
  4. Hyperparameter Tuning: The paper provides guidance on embedding dimensions (optimal 60-80, depending on dataset size) and composition operations (circular-correlation slightly best). Tuning the RL parameters (ϵ\epsilon, initial ktk_t) and loss weights (λγ,λ1,λ2\lambda_\gamma, \lambda_1, \lambda_2) is necessary. The RNS convergence condition is also a tunable aspect.
  5. Flexibility: The framework's modularity allows different GNN architectures to be potentially used in both the pre-training and fine-tuning steps. The composition operations offer flexibility in how relation information is integrated.
  6. Attribute Importance: The relation-level attention explicitly models the varying importance of different attribute relationships, providing interpretability and improving embedding quality.

Experimental results on MovieLens and KKBox datasets demonstrate that the full RAM-GNN framework significantly outperforms various state-of-the-art recommendation models, including standard matrix factorization, feature-based models, and other GNN-based recommenders like LightGCN and KGPolicy. The ablation paper confirms the positive impact of both the relation-level attention and the Reinforced Neighbor Sampler within the pre-training phase. The RNS module is shown to learn relation-specific thresholds that converge, varying appropriately based on dataset size and relation density. The paper highlights the practical advantage of pre-training on attribute information to enrich embeddings before incorporating collaborative signals.