MMGRec: Multimodal Generative Recommendation with Transformer Model (2404.16555v1)
Abstract: Multimodal recommendation aims to recommend user-preferred candidates based on her/his historically interacted items and associated multimodal information. Previous studies commonly employ an embed-and-retrieve paradigm: learning user and item representations in the same embedding space, then retrieving similar candidate items for a user via embedding inner product. However, this paradigm suffers from inference cost, interaction modeling, and false-negative issues. Toward this end, we propose a new MMGRec model to introduce a generative paradigm into multimodal recommendation. Specifically, we first devise a hierarchical quantization method Graph RQ-VAE to assign Rec-ID for each item from its multimodal and CF information. Consisting of a tuple of semantically meaningful tokens, Rec-ID serves as the unique identifier of each item. Afterward, we train a Transformer-based recommender to generate the Rec-IDs of user-preferred items based on historical interaction sequences. The generative paradigm is qualified since this model systematically predicts the tuple of tokens identifying the recommended item in an autoregressive manner. Moreover, a relation-aware self-attention mechanism is devised for the Transformer to handle non-sequential interaction sequences, which explores the element pairwise relation to replace absolute positional encoding. Extensive experiments evaluate MMGRec's effectiveness compared with state-of-the-art methods.
- A simple but tough-to-beat baseline for sentence embeddings. In International Conference on Learning Representations. 1–16.
- Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423–443.
- Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of International Conference on Computational Statistics. 177–186.
- Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation. IEEE Transactions on Multimedia 24 (2021), 805–818.
- Novel hybrid hierarchical-K-means clustering method (HK-means) for microarray analysis. In IEEE Computational Systems Bioinformatics Conference-Workshops. 105–108.
- Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 335–344.
- Autoregressive entity retrieval. In International Conference on Learning Representations.
- Invariant representation learning for multimedia recommendation. In Proceedings of ACM International Conference on Multimedia. 619–628.
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of International Conference on Artificial Intelligence and Statistics. 249–256.
- Inductive representation learning on large graphs. In Conference on Neural Information Processing Systems. 1024–1034.
- A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 87–110.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
- Ruining He and Julian McAuley. 2016. VBPR: Visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI Conference on Artificial Intelligence. 144–150.
- Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
- Neural collaborative filtering. In Proceedings of International World Wide Web Conference. 173–182.
- CNN architectures for large-scale audio classification. In International Conference on Acoustics, Speech and Signal Processing. 131–135.
- Collaborative metric learning. In Proceedings of International Conference on World Wide Web. 193–201.
- Tutorial on large language models for recommendation. In Proceedings of ACM Conference on Recommender Systems. 1281–1283.
- Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.
- PEAR: Personalized Re-ranking with Contextualized Transformer for Recommendation. In Proceedings of International World Wide Web Conference. 62–66.
- Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation. ACM Transactions on Information Systems 42, 2 (2023), 1–26.
- User-video co-attention network for personalized micro-video recommendation. In Proceedings of International World Wide Web Conference. 3020–3026.
- Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics. 1864–1874.
- Recommender systems with generative retrieval. In Proceedings of International Conference on Neural Information Processing Systems.
- BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of Conference on Uncertainty in Artificial Intelligence. 452–461.
- Item-based collaborative filtering recommendation algorithms. In Proceedings of International Conference on World Wide Web. 285–295.
- Self-Attention with Relative Position Representations. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 464–468.
- LARA: Attribute-to-feature adversarial learning for new-item recommendation. In Proceedings of International Conference on Web Search and Data Mining. 582–590.
- Sequence to sequence learning with neural networks. In Proceedings of International Conference on Neural Information Processing Systems. 3104–3112.
- Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems 35 (2022), 21831–21843.
- Attention is all you need. In Proceedings of International Conference on Neural Information Processing Systems. 6000–6010.
- Graph attention networks. In International Conference on Learning Representations. 1–12.
- Neural graph collaborative filtering. In Proceedings of International ACM SIGIR conference on Research and Development in Information Retrieval. 165–174.
- Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of International Conference on World Wide Web. 99–109.
- LightGT: A light graph transformer for multimedia recommendation. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 1508–1517.
- Graph-refined convolutional network for multimedia recommendation with implicit feedback. In Proceedings of ACM International Conference on Multimedia. 3541–3549.
- MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of ACM International Conference on Multimedia. 1437–1445.
- Transformers: State-of-the-art natural language processing. In Proceedings of Conference on Empirical Methods in Natural Language Processing. 38–45.
- SSE-PT: Sequential recommendation via personalized transformer. In Proceedings of ACM Conference on Recommender Systems. 328–337.
- Multiplex behavioral relation learning for recommendation via memory augmented transformer network. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 2397–2406.
- Generate what you prefer: Reshaping sequential recommendation via guided diffusion. In Proceedings of International Conference on Neural Information Processing Systems.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), 495–507.
- Mining Latent Structures for Multimedia Recommendation. In Proceedings of ACM International Conference on Multimedia. 3872–3880.
- Han Liu (340 papers)
- Yinwei Wei (36 papers)
- Xuemeng Song (30 papers)
- Weili Guan (35 papers)
- Yuan-Fang Li (90 papers)
- Liqiang Nie (191 papers)