Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing ID and Text Fusion via Alternative Training in Session-based Recommendation (2402.08921v1)

Published 14 Feb 2024 in cs.IR and cs.AI

Abstract: Session-based recommendation has gained increasing attention in recent years, with its aim to offer tailored suggestions based on users' historical behaviors within sessions. To advance this field, a variety of methods have been developed, with ID-based approaches typically demonstrating promising performance. However, these methods often face challenges with long-tail items and overlook other rich forms of information, notably valuable textual semantic information. To integrate text information, various methods have been introduced, mostly following a naive fusion framework. Surprisingly, we observe that fusing these two modalities does not consistently outperform the best single modality by following the naive fusion framework. Further investigation reveals an potential imbalance issue in naive fusion, where the ID dominates and text modality is undertrained. This suggests that the unexpected observation may stem from naive fusion's failure to effectively balance the two modalities, often over-relying on the stronger ID modality. This insight suggests that naive fusion might not be as effective in combining ID and text as previously expected. To address this, we propose a novel alternative training strategy AlterRec. It separates the training of ID and text, thereby avoiding the imbalance issue seen in naive fusion. Additionally, AlterRec designs a novel strategy to facilitate the interaction between the two modalities, enabling them to mutually learn from each other and integrate the text more effectively. Comprehensive experiments demonstrate the effectiveness of AlterRec in session-based recommendation. The implementation is available at https://github.com/Juanhui28/AlterRec.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  2. Recent developments in multilayer perceptron neural networks. In Proceedings of the seventh annual memphis area engineering and science conference, MAESC. 1–15.
  3. On Uni-Modal Feature Learning in Supervised Multi-Modal Learning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 8632–8656.
  4. On Uni-Modal Feature Learning in Supervised Multi-Modal Learning. arXiv preprint arXiv:2305.01233 (2023).
  5. Leveraging Large Language Models for Sequential Recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems. 1096–1102.
  6. Session-based Recommendations with Recurrent Neural Networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.).
  7. Core: simple and effective session-based recommendation within consistent representation space. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval. 1796–1801.
  8. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
  9. Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably). In International Conference on Machine Learning. PMLR, 9226–9259.
  10. Amazon-m2: A multilingual multi-locale shopping session dataset for recommendation and text generation. arXiv preprint arXiv:2307.09688 (2023).
  11. Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  12. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.
  13. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
  14. MINER: multi-interest matching network for news recommendation. In Findings of the Association for Computational Linguistics: ACL 2022. 343–352.
  15. Exploring the Upper Limits of Text-Based Collaborative Filtering Using Large Language Models: Discoveries and Insights. arXiv preprint arXiv:2305.11700 (2023).
  16. Multibench: Multiscale benchmarks for multimodal representation learning. arXiv preprint arXiv:2107.07502 (2021).
  17. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  18. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197.
  19. Heterogeneous global graph neural networks for personalized session-based recommendation. In Proceedings of the fifteenth ACM international conference on web search and data mining. 775–783.
  20. Yoon-Joo Park and Alexander Tuzhilin. 2008. The long tail of recommender systems and how to leverage it. In Proceedings of the 2008 ACM conference on Recommender systems. 11–18.
  21. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8238–8247.
  22. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
  23. Representation Learning with Large Language Models for Recommendation. arXiv preprint arXiv:2310.15950 (2023).
  24. BPR: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618 (2012).
  25. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
  26. Modeling user preference dynamics with coupled tensor factorization for social media recommendation. Journal of Ambient Intelligence and Humanized Computing 12 (2021), 9693–9712.
  27. Attention is all you need. Advances in neural information processing systems 30 (2017).
  28. What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12695–12705.
  29. LLMRec: Large Language Models with Graph Augmentation for Recommendation. arXiv preprint arXiv:2311.00423 (2023).
  30. Neural news recommendation with multi-head self-attention. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 6389–6394.
  31. Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In International Conference on Machine Learning. PMLR, 24043–24055.
  32. Session-based recommendation with graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 346–353.
  33. Where to go next for recommender systems? id-vs. modality-based recommender models revisited. arXiv preprint arXiv:2303.13835 (2023).
  34. Feature-level Deeper Self-Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
  35. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management. 1893–1902.
Citations (1)

Summary

  • The paper introduces AlterRec, which alternates training of separate ID and text networks to overcome the dominance of a single modality.
  • It employs hard negative sampling based on the partner network's predictions to fine-tune each modality effectively.
  • Experimental results demonstrate that AlterRec improves recommendation accuracy and enhances long-tail item performance compared to naive fusion methods.

This paper addresses the challenge of effectively combining item ID information and textual information (like titles and descriptions) for session-based recommendation (2402.08921). While ID-based methods are often effective, they struggle with sparse data (long-tail items) and ignore rich semantic signals in text. Methods integrating text commonly use a "naive fusion" approach: encode IDs and text separately, merge the embeddings (e.g., sum or concatenate), and train jointly.

The authors conduct a preliminary paper and find, counterintuitively, that this naive fusion often performs similarly to or even worse than training solely on IDs. Further analysis (using their implementation NFRec and existing models like UniSRec and FDSA) suggests an "imbalance issue": the ID modality, often being stronger initially, dominates the joint training process, leading the text modality to be undertrained and contribute little to the final performance. This aligns with findings in broader multi-modal learning research indicating fusion doesn't always beat the best single modality.

To overcome this imbalance and effectively leverage both modalities, the paper proposes AlterRec, a novel framework based on alternative training. Instead of joint fusion and training, AlterRec maintains two separate uni-modal networks:

  1. ID Uni-modal Network: Uses an ID embedding layer and potentially a sequence encoder (like Mean pooling or Transformer) to generate session representations and predict next items based solely on IDs.
  2. Text Uni-modal Network: Uses a text encoder (Sentence-BERT fixed due to computational cost, followed by an MLP) and a sequence encoder (Mean or Transformer) to generate session representations and predict next items based solely on text features.

The core innovation lies in the training strategy:

  • Separate Training: The ID and text networks are trained separately, avoiding the optimization imbalance seen in naive fusion.
  • Alternative Updates: The training alternates between the ID and text networks (e.g., train ID for mgapm_{gap} epochs, then text for mgapm_{gap} epochs, repeat).
  • Prediction-based Interaction: Crucially, the networks learn from each other implicitly. When training one network (e.g., text network), predictions from the other network (e.g., ID network) are used to generate training signals:
    • Hard Negative Sampling: Instead of random negatives, items ranked highly by the ID network (but not the true target) are used as hard negatives for training the text network (ranks k1k_1 to k2k_2). This forces the text network to learn distinctions relevant to the ID network's perspective. The same process applies vice-versa.
    • Positive Sample Augmentation (Optional): Items ranked top-pp by the ID network can be used as additional positive targets (weighted by β\beta) when training the text network, providing extra signal, especially for sparse items. This variant is called AlterRec_aug.

The loss function for each network is typically cross-entropy, calculated using the true target(s) and the hard negative samples derived from the other network's predictions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
id_scores = id_network.predict(sessions)
id_rankings = argsort(id_scores, descending=True)

hard_negatives = id_rankings[:, k1:k2]
augmented_positives = id_rankings[:, 0:p] # Optional

text_scores = text_network.predict(sessions)

loss = calculate_cross_entropy_loss(
    text_scores,
    true_targets,
    hard_negatives,
    augmented_positives, # Optional
    beta # Optional weight
)

optimizer_text.step()

During inference, final prediction scores are a weighted sum of the scores from the converged ID and text networks:

ys,i=αys,iID+(1α)ys,itexty_{\mathbf{s},i} = \alpha* y^{ID}_{\mathbf{s},i} + (1-\alpha)*y^{text}_{\mathbf{s},i}

where α\alpha is a hyperparameter balancing the contribution of each modality.

Implementation Considerations:

  • Text Encoder: Using pre-trained Sentence-BERT and keeping it fixed during training significantly reduces computational cost. An MLP is used to project SBERT embeddings to the desired dimension.
  • Sequence Encoders: The choice between Mean pooling and Transformer for session embedding (gmeang_{mean} vs gTransg_{Trans}) depends on the dataset (empirical choice).
  • Hyperparameters: Key parameters include the negative sampling ranks (k1,k2k_1, k_2), augmentation rank (pp), augmentation weight (β\beta), alternation gap (mgapm_{gap}), and the final score weighting (α\alpha). Parameter analysis shows reasonable stability, with α=0.5\alpha=0.5 often performing well.
  • Training Stages: The paper uses an initial phase (mrandomm_{random} epochs) where networks train with random negatives before switching to the alternating hard-negative strategy.

Experiments & Results:

  • Datasets: Homedepot (private) and Amazon-M2 (public, using Spanish, French, Italian subsets).
  • Baselines: Included ID-based (SASRec, BERT4Rec, SR-GNN, CORE, HG-GNN) and text-integrated (UniSRec, FDSA, S³-Rec, LLM2BERT4Rec variants).
  • Findings:
    • AlterRec and AlterRec_aug consistently outperformed strong baselines across datasets and metrics (Hits@N, NDCG@N).
    • Ablation studies confirmed the effectiveness of using hard negatives (vs. random negatives, which mimics independent training) and the contribution of both ID and text modalities within AlterRec.
    • Analysis showed AlterRec successfully trains both modalities without the imbalance issue observed in naive fusion (Figure 8).
    • AlterRec demonstrated significant improvements, particularly for long-tail items (Figure 9), indicating effective use of textual information where ID interactions are sparse.

Conclusion: The paper identifies a critical imbalance issue in naive fusion methods for session-based recommendation and proposes AlterRec, an effective alternative training strategy. By training ID and text networks separately but facilitating interaction through shared hard negative (and optionally positive) samples derived from each other's predictions, AlterRec achieves better integration of textual information, leading to state-of-the-art performance, especially on long-tail items. The implementation is available publicly.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets