Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer (1904.06690v2)

Published 14 Apr 2019 in cs.IR and cs.LG

Abstract: Modeling users' dynamic and evolving preferences from their historical behaviors is challenging and crucial for recommendation systems. Previous methods employ sequential neural networks (e.g., Recurrent Neural Network) to encode users' historical interactions from left to right into hidden representations for making recommendations. Although these methods achieve satisfactory results, they often assume a rigidly ordered sequence which is not always practical. We argue that such left-to-right unidirectional architectures restrict the power of the historical sequence representations. For this purpose, we introduce a Bidirectional Encoder Representations from Transformers for sequential Recommendation (BERT4Rec). However, jointly conditioning on both left and right context in deep bidirectional model would make the training become trivial since each item can indirectly "see the target item". To address this problem, we train the bidirectional model using the Cloze task, predicting the masked items in the sequence by jointly conditioning on their left and right context. Comparing with predicting the next item at each position in a sequence, the Cloze task can produce more samples to train a more powerful bidirectional model. Extensive experiments on four benchmark datasets show that our model outperforms various state-of-the-art sequential models consistently.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Fei Sun (151 papers)
  2. Jun Liu (606 papers)
  3. Jian Wu (315 papers)
  4. Changhua Pei (19 papers)
  5. Xiao Lin (181 papers)
  6. Wenwu Ou (37 papers)
  7. Peng Jiang (274 papers)
Citations (1,843)

Summary

  • The paper introduces a bidirectional Transformer model (BERT4Rec) that enhances sequential recommendations with a novel Cloze task training method.
  • It outperforms state-of-the-art methods, demonstrating an average 11.03% NDCG@10 improvement across multiple datasets.
  • Ablation studies confirm that components like positional embeddings and multi-head self-attention are essential for the model’s performance.

BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer

This paper introduces BERT4Rec, a sequential recommendation model leveraging bidirectional encoder representations via the Transformer architecture, enhancing the modeling of user behavior sequences in recommendation systems. The paper argues that traditional unidirectional models constrained the potential representation power due to their left-to-right encoding approach. Consequently, BERT4Rec is proposed to mitigate these limitations by employing bidirectional self-attention mechanisms enabled through the Transformer.

Model Architecture

BERT4Rec utilizes a multi-layer bidirectional Transformer to encode user behavior sequences. The model stacks LL Transformer layers, each containing a multi-head self-attention sub-layer and a position-wise feed-forward network sub-layer. The self-attention mechanism allows the model to capture dependencies between items across the entire sequence without regard to their positional distance. This results in a global receptive field that directly captures sequential patterns more efficiently than traditional RNNs or CNNs.

Cloze Task for Training

A key feature of BERT4Rec is its training methodology using the Cloze task, which is adapted to prevent information leakage inherent in bidirectional models when predicting future items. Instead of training by predicting the next item in a sequence, the Cloze task randomly masks some items in the input sequence and predicts these masked items based on their context. This task trains the model to encode the sequence effectively by leveraging both left and right contexts, a significant departure from unidirectional training objectives seen in models like GRU4Rec and SASRec.

Experimental Results

The efficacy of BERT4Rec was evaluated on four distinct datasets: Amazon Beauty, Steam, MovieLens-1M (ML-1m), and MovieLens-20M (ML-20m). The model consistently outperformed various state-of-the-art methods such as GRU4Rec, Caser, and SASRec. For instance, BERT4Rec yielded an average of 11.03% improvement in NDCG@10 over the strongest baselines across datasets.

Analysis and Ablation Studies

Several analyses were conducted to decode the contributions of different model components:

  1. Bidirectional Context Encoding: Comparative studies revealed that the bidirectional nature of BERT4Rec significantly enhanced performance over unidirectional models. The improved results emphasize the advantage of conditioning on context from both directions within sequences.
  2. Mask Proportion in Cloze Task: Experiments varying the mask proportion in the Cloze task (ρ\rho) indicated optimal values varied with the sequence length of datasets. Optimal ρ\rho values led to statistically significant performance boosts, particularly with longer sequences.
  3. Hyperparameter Sensitivity: Various hyperparameters, notably hidden dimensionality (dd) and sequence length (NN), were scrutinized. The results underscored that higher dimensionality stabilized model performance, although excessive length could introduce noise and overfitting.
  4. Ablation of Components: Removing or altering elements like positional embedding, position-wise feed-forward layers, layer normalization, and residual connections showed distinct performance drops, validating their necessity. For instance, discounting positional embeddings resulted in substantial degradation, especially in long sequence datasets like ML-1m and ML-20m.

Implications and Future Work

BERT4Rec introduces an impactful approach to sequential recommendation, pushing the envelope in how bidirectional context and self-attention can be utilized in recommendation systems. Practically, this has implications for developing more accurate and user-responsive recommendation engines, particularly as consumer behavior data becomes increasingly comprehensive and temporally rich.

Theoretically, this work extends the applicability of Transformer architectures beyond NLP into recommendation systems, highlighting the flexibility and adaptability of deep learning methodologies.

Future work may explore integrating rich item metadata (such as product features and prices) and explicit user modeling (possibly considering user sessions in broader contexts). These augmentations could further refine BERT4Rec’s recommendation accuracy and generalizability.

In conclusion, BERT4Rec represents a significant advancement in using deep bidirectional self-attention mechanisms for sequential recommendation tasks, proving the merit of exploring and adapting state-of-the-art NLP approaches to recommendation systems.

Youtube Logo Streamline Icon: https://streamlinehq.com