Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMGRec: Multimodal Generative Recommendation with Transformer Model (2404.16555v1)

Published 25 Apr 2024 in cs.IR

Abstract: Multimodal recommendation aims to recommend user-preferred candidates based on her/his historically interacted items and associated multimodal information. Previous studies commonly employ an embed-and-retrieve paradigm: learning user and item representations in the same embedding space, then retrieving similar candidate items for a user via embedding inner product. However, this paradigm suffers from inference cost, interaction modeling, and false-negative issues. Toward this end, we propose a new MMGRec model to introduce a generative paradigm into multimodal recommendation. Specifically, we first devise a hierarchical quantization method Graph RQ-VAE to assign Rec-ID for each item from its multimodal and CF information. Consisting of a tuple of semantically meaningful tokens, Rec-ID serves as the unique identifier of each item. Afterward, we train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences. The generative paradigm is qualified since this model systematically predicts the tuple of tokens identifying the recommended item in an autoregressive manner. Moreover, a relation-aware self-attention mechanism is devised for the Transformer to handle non-sequential interaction sequences, which explores the element pairwise relation to replace absolute positional encoding. Extensive experiments evaluate MMGRec's effectiveness compared with state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations. 1–16.
  2. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423–443.
  3. Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of International Conference on Computational Statistics. 177–186.
  4. Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation. IEEE Transactions on Multimedia 24 (2021), 805–818.
  5. Novel hybrid hierarchical-K-means clustering method (HK-means) for microarray analysis. In IEEE Computational Systems Bioinformatics Conference-Workshops. 105–108.
  6. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 335–344.
  7. Autoregressive entity retrieval. In International Conference on Learning Representations.
  8. Invariant representation learning for multimedia recommendation. In Proceedings of ACM International Conference on Multimedia. 619–628.
  9. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of International Conference on Artificial Intelligence and Statistics. 249–256.
  10. Inductive representation learning on large graphs. In Conference on Neural Information Processing Systems. 1024–1034.
  11. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 87–110.
  12. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
  13. Ruining He and Julian McAuley. 2016. VBPR: Visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence. 144–150.
  14. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
  15. Neural collaborative filtering. In Proceedings of International World Wide Web Conference. 173–182.
  16. CNN architectures for large-scale audio classification. In International Conference on Acoustics, Speech and Signal Processing. 131–135.
  17. Collaborative metric learning. In Proceedings of International Conference on World Wide Web. 193–201.
  18. Tutorial on large language models for recommendation. In Proceedings of ACM Conference on Recommender Systems. 1281–1283.
  19. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
  20. PEAR: Personalized Re-ranking with Contextualized Transformer for Recommendation. In Proceedings of International World Wide Web Conference. 62–66.
  21. Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation. ACM Transactions on Information Systems 42, 2 (2023), 1–26.
  22. User-video co-attention network for personalized micro-video recommendation. In Proceedings of International World Wide Web Conference. 3020–3026.
  23. Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics. 1864–1874.
  24. Recommender systems with generative retrieval. In Proceedings of International Conference on Neural Information Processing Systems.
  25. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of Conference on Uncertainty in Artificial Intelligence. 452–461.
  26. Item-based collaborative filtering recommendation algorithms. In Proceedings of International Conference on World Wide Web. 285–295.
  27. Self-Attention with Relative Position Representations. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 464–468.
  28. LARA: Attribute-to-feature adversarial learning for new-item recommendation. In Proceedings of International Conference on Web Search and Data Mining. 582–590.
  29. Sequence to sequence learning with neural networks. In Proceedings of International Conference on Neural Information Processing Systems. 3104–3112.
  30. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems 35 (2022), 21831–21843.
  31. Attention is all you need. In Proceedings of International Conference on Neural Information Processing Systems. 6000–6010.
  32. Graph attention networks. In International Conference on Learning Representations. 1–12.
  33. Neural graph collaborative filtering. In Proceedings of International ACM SIGIR conference on Research and Development in Information Retrieval. 165–174.
  34. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of International Conference on World Wide Web. 99–109.
  35. LightGT: A light graph transformer for multimedia recommendation. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 1508–1517.
  36. Graph-refined convolutional network for multimedia recommendation with implicit feedback. In Proceedings of ACM International Conference on Multimedia. 3541–3549.
  37. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of ACM International Conference on Multimedia. 1437–1445.
  38. Transformers: State-of-the-art natural language processing. In Proceedings of Conference on Empirical Methods in Natural Language Processing. 38–45.
  39. SSE-PT: Sequential recommendation via personalized transformer. In Proceedings of ACM Conference on Recommender Systems. 328–337.
  40. Multiplex behavioral relation learning for recommendation via memory augmented transformer network. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 2397–2406.
  41. Generate what you prefer: Reshaping sequential recommendation via guided diffusion. In Proceedings of International Conference on Neural Information Processing Systems.
  42. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), 495–507.
  43. Mining Latent Structures for Multimedia Recommendation. In Proceedings of ACM International Conference on Multimedia. 3872–3880.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Han Liu (340 papers)
  2. Yinwei Wei (36 papers)
  3. Xuemeng Song (30 papers)
  4. Weili Guan (35 papers)
  5. Yuan-Fang Li (90 papers)
  6. Liqiang Nie (191 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com