Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey (2404.00621v2)

Published 31 Mar 2024 in cs.IR and cs.MM

Abstract: Personalized recommendation serves as a ubiquitous channel for users to discover information tailored to their interests. However, traditional recommendation models primarily rely on unique IDs and categorical features for user-item matching, potentially overlooking the nuanced essence of raw item contents across multiple modalities such as text, image, audio, and video. This underutilization of multimodal data poses a limitation to recommender systems, especially in multimedia services like news, music, and short-video platforms. The recent advancements in large multimodal models offer new opportunities and challenges in developing content-aware recommender systems. This survey seeks to provide a comprehensive exploration of the latest advancements and future trajectories in multimodal pretraining, adaptation, and generation techniques, as well as their applications in enhancing recommender systems. Furthermore, we discuss current open challenges and opportunities for future research in this dynamic domain. We believe that this survey, alongside the curated resources, will provide valuable insights to inspire further advancements in this evolving landscape.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (161)
  1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NIPS (2021).
  2. Jean-Baptiste Alayrac et al. 2022. Flamingo: a visual language model for few-shot learning. In NIPS.
  3. Put Your Voice on Stage: Personalized Headline Generation for News Articles. TKDD (2023).
  4. PENS: A Dataset and Generic Framework for Personalized News Headline Generation. In ACL.
  5. Itemsage: Learning product embeddings for shopping recommendations at pinterest. In SIGKDD. 2703–2711.
  6. Language models are few-shot learners. NIPS (2020).
  7. Generating User-Engaging News Headlines. In ACL.
  8. Codified audio language modeling learns useful representations for music information retrieval. arXiv preprint arXiv:2107.05677 (2021).
  9. Automated Creative Optimization for E-Commerce Advertising. In WWW.
  10. Learning audio embeddings with user listening data for content-based music recommendation. In ICASSP.
  11. MidiBERT-piano: large-scale pre-training for symbolic music understanding. In arXiv.
  12. A Review of Modern Fashion Recommender Systems. ACM Comput. Surv. 56, 4 (2024), 87:1–87:37.
  13. Recommender Systems Leveraging Multimedia Content. ACM Comput. Surv. 53, 5 (2021), 106:1–106:38.
  14. Toward Personalized Answer Generation in E-Commerce via Multi-perspective Preference Modeling. TOIS (2022).
  15. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
  16. Zijian Ding et al. 2023. Harnessing the power of LLMs: Evaluating human-AI text co-creation through the lens of news headline generation. In EMNLP.
  17. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
  18. How to learn item representation for cold-start multimedia recommendation?. In MM.
  19. Clap learning audio concepts from natural language supervision. In ICASSP. IEEE.
  20. Clap learning audio concepts from natural language supervision. In ICASSP. IEEE, 1–5.
  21. Jiabao Fang et al. 2024. A Multi-Agent Conversational Recommender System. In arXiv.
  22. Muhammad Farooq and Abdul Hafeez. 2020. Covid-resnet: A deep learning framework for screening of covid19 from radiographs. arXiv (2020).
  23. A Large Language Model Enhanced Conversational Recommender System. In arXiv.
  24. Exploring adapter-based transfer learning for recommender systems: Empirical studies and practical insights. In WSDM.
  25. Yifan Gao et al. 2023. TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design. In MM.
  26. Image Matters: Visually Modeling User Behaviors Using Advanced Model Server. In CIKM.
  27. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In RecSys. 299–315.
  28. VIP5: Towards Multimodal Foundation Models for Recommendation. In EMNLP.
  29. Imagebind: One embedding space to bind them all. In CVPR.
  30. AtomoVideo: High Fidelity Image-to-Video Generation. In arXiv.
  31. Generating Representative Headlines for News Stories. In WWW. 1773–1784.
  32. Audioclip: Extending clip to image, text and audio. In ICASSP. IEEE, 976–980.
  33. Pre-training graph neural networks for cold-start users and items representation. In WSDM.
  34. Masked autoencoders are scalable vision learners. In CVPR.
  35. Deep residual learning for image recognition. In CVPR. 770–778.
  36. Learning vector-quantized item representation for transferable sequential recommenders. In WWW.
  37. Towards universal sequence representation learning for recommender systems. In SIGKDD.
  38. Parameter-efficient transfer learning for NLP. In ICML.
  39. PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout. In CVPR.
  40. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
  41. Lightweight Modality Adaptation to Sequential Recommendation via Correlation Supervision. ECIR (2024).
  42. Chengkai Huang et al. 2024. Foundation Models for Recommender Systems: A Survey and New Perspectives. CoRR abs/2402.11143 (2024).
  43. Large-scale weakly-supervised content embeddings for music recommendation and tagging. In ICASSP. IEEE.
  44. LayoutDM: Discrete Diffusion Model for Controllable Layout Generation. In CVPR.
  45. Entangled bidirectional encoder to autoregressive decoder for sequential recommendation. In SIGIR.
  46. Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In ICDM. IEEE, 197–206.
  47. Maple: Multi-modal prompt learning. In CVPR.
  48. Diederik P. Kingma et al. 2014. Auto-Encoding Variational Bayes. In ICLR.
  49. Mateusz Krubinski and Pavel Pecina. 2024. Towards Unified Uni- and Multi-modal News Headline Generation. In EACL. 437–450.
  50. Semi: A sequential multi-modal information transfer network for e-commerce micro-video recommendations. In SIGIDD.
  51. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.
  52. TagGPT: Large Language Models are Zero-shot Multimodal Taggers. arXiv (2023).
  53. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML.
  54. Text is all you need: Learning language representations for sequential recommendation. In SIGKDD.
  55. MINER: Multi-interest matching network for news recommendation. In ACL.
  56. Personalized Prompt Learning for Explainable Recommendation. arXiv (2023).
  57. Adversarial Multimodal Representation Learning for Click-Through Rate Prediction. In WWW.
  58. Pbnr: Prompt-based news recommender system. arXiv (2023).
  59. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv (2021).
  60. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv (2023).
  61. AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation. In MM.
  62. Noninvasive self-attention for side information fusion in sequential recommendation. In AAAI.
  63. Generating Engaging Promotional Videos for E-commerce Platforms. In AAAI.
  64. RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models. arXiv (2023).
  65. Visual instruction tuning. NIPS (2024).
  66. Category-Specific CNN for Visual-aware CTR Prediction at JD. com. In SIGKDD.
  67. Multimodal graph contrastive learning for multimedia-based recommendation. TMM (2023).
  68. Once: Boosting content-based recommendation with both open-and closed-source large language models. In WSDM.
  69. Discrete Semantic Tokenization for Deep CTR Prediction. In WWW.
  70. Multimodal Recommender Systems: A Survey. arXiv (2023).
  71. Boosting deep CTR prediction with a plug-and-play pre-trainer for news recommendation. In COLING.
  72. User-video co-attention network for personalized micro-video recommendation. In WWW.
  73. Mandari: Multi-Modal Temporal Knowledge Graph-aware Sub-graph Embedding for Next-POI Recommendation. ICME (2023).
  74. Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models. arXiv (2024).
  75. Pre-training graph transformer with multimodal side information for recommendation. In MM.
  76. An Aligning and Training Framework for Multimodal Recommendations. arXiv (2024).
  77. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models.
  78. Multi-modal contrastive pre-training for recommendation. In ICML.
  79. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NIPS.
  80. Formalizing Multimedia Recommendation through Multimodal Deep Learning. CoRR abs/2309.05273 (2023).
  81. CAMERA: A Multimodal Dataset and Benchmark for Ad Text Generation. arXiv.
  82. Natural Language Generation for Advertising: A Survey.
  83. A Content-Driven Micro-Video Recommendation Dataset at Scale. arXiv (2023).
  84. OpenAI. 2023a. ChatGPT. https://chat.openai.com/chat.
  85. OpenAI. 2023b. Gpt-4 technical report. (2023).
  86. Click-through rate prediction with auto-quantized contrastive learning. arXiv (2021).
  87. Next Point-of-Interest Recommendation with Auto-Correlation Enhanced Multi-Modal Transformer Network. In SIGIR.
  88. U-BERT: Pre-training user representations for improved recommendation. In AAAI.
  89. Learning transferable visual models from natural language supervision. In ICML.
  90. Improving language understanding by generative pre-training. (2018).
  91. Language models are unsupervised multitask learners. OpenAI blog (2019).
  92. Recommender systems with generative retrieval. NIPS (2024).
  93. Hierarchical text-conditional image generation with clip latents. In arXiv.
  94. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
  95. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR.
  96. Learning Internal Representations by Error Propagation, Parallel Distributed Processing, Explorations in the Microstructure of Cognition. Biometrika (1986).
  97. LaMP: When Large Language Models Meet Personalization. arXiv (2023).
  98. Ziplora: Any subject in any style by effectively merging loras. arXiv (2023).
  99. Learning fine-grained user interests for micro-video recommendation. In SIGIR.
  100. Enhancing music recommendation with social media content: an attentive multimodal autoencoder approach. In IJCNN. IEEE.
  101. PMG: Personalized Multimodal Generation with Large Language Models. In WWW.
  102. LayerConnect: Hypernetwork-Assisted Inter-Layer Connector to Enhance Parameter Efficiency. In COLING.
  103. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In arXiv.
  104. Is ChatGPT A Good Keyphrase Generator? A Preliminary Study. arXiv (2023).
  105. MM-FRec: Multi-Modal Enhanced Fashion Item Recommendation. TKDE (2023).
  106. Janne Spijkervet and John Ashley Burgoyne. 2021. Contrastive Learning of Musical Representations. In ISMIR.
  107. Vl-bert: Pre-training of generic visual-linguistic representations. In arXiv.
  108. Videobert: A joint model for video and language representation learning. In CVPR.
  109. How to fine-tune bert for text classification?. In CCL. Springer.
  110. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In CIKM.
  111. Universal Multi-modal Multi-domain Pre-trained Recommendation. arXiv (2023).
  112. Mgat: Multimodal graph attention network for recommendation. IPM (2020).
  113. Llama: Open and efficient foundation language models. In arXiv.
  114. Llama 2: Open foundation and fine-tuned chat models. arXiv (2023).
  115. AnyText: Multilingual Visual Text Generation And Editing. arXiv (2023).
  116. Neural discrete representation learning. NIPS 30 (2017).
  117. Voyager: An open-ended embodied agent with large language models. arXiv (2023).
  118. Transrec: Learning transferable recommendation from mixture-of-modality feedback. arXiv (2022).
  119. Missrec: Pre-training and transferring multi-modal interest-aware sequence representation for recommendation. In MM.
  120. MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation. arXiv (2024).
  121. Weak Supervision Enhanced Generative Network for Question Generation. In IJCAI.
  122. Zero-shot Clarifying Question Generation for Conversational Search. In WWW.
  123. Chain-of-thought prompting elicits reasoning in large language models. NIPS (2022).
  124. Multi-modal self-supervised learning for recommendation. In WWW. 790–800.
  125. PromptMM: Multi-Modal Knowledge Distillation for Recommendation with Prompt-Tuning. arXiv (2024).
  126. Lightgt: A light graph transformer for multimedia recommendation. In SIGIR.
  127. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In MM.
  128. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. In MM.
  129. Neural news recommendation with multi-head self-attention. In EMNLP.
  130. MM-Rec: Visiolinguistic Model Empowered Multimodal News Recommendation. In SIGIR.
  131. Empowering news recommendation with pre-trained language models. In SIGIR. 1652–1656.
  132. Userbert: Pre-training user model with contrastive self-supervision. In SIGIR. 2087–2092.
  133. NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application. In EMNLP.
  134. Towards open-world recommendation with knowledge augmentation from large language models. In arXiv.
  135. UPRec: User-aware Pre-training for sequential Recommendation. AI Open (2023).
  136. From Abstract to Details: A Generative Multimodal Fusion Framework for Recommendation. In MM.
  137. Contrastive learning for sequential recommendation. In ICDE.
  138. Lanling Xu et al. 2024. Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis. In arXiv.
  139. K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce. In EMNLP.
  140. Why do we click: visual impression-aware news recommendation. In MM.
  141. Visual Encoding and Debiasing for CTR Prediction. In CIKM.
  142. Learning Compositional, Visual and Relational Representations for CTR Prediction in Sponsored Search. In CIKM.
  143. Parallel Ranking of Ads and Creatives in Real-Time Advertising Systems. In arXiv.
  144. Contrastive learning with positive-negative frame mask for music representation. In WWW.
  145. Multi-modal Graph Contrastive Learning for Micro-video Recommendation. SIGIR (2022).
  146. Parameter-efficient transfer from sequential behaviors for user modeling and recommendation. In SIGIR.
  147. Where to Go Next for Recommender Systems? ID- vs. Modality-based Recommender Models Revisited. In SIGIR.
  148. MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training. In ACL.
  149. Musicbert: Symbolic music understanding with large-scale pre-training. arXiv preprint arXiv:2106.05630 (2021).
  150. Multimodal pre-training framework for sequential recommendation via contrastive learning. arXiv (2023).
  151. UNBERT: User-News Matching BERT for News Recommendation.. In IJCAI.
  152. Personalized Reason Generation for Explainable Song Recommendation. TIST (2019).
  153. Adapting large language models by integrating collaborative semantics for recommendation. In ICDE.
  154. A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions. CoRR abs/2302.04473 (2023).
  155. GCOF: Self-iterative Text Generation for Copywriting Using Large Language Model. In arXiv.
  156. Learning to prompt for vision-language models. IJCV (2022).
  157. Controlled Text Generation with Natural Language Instructions. In ICML.
  158. Xin Zhou and Chunyan Miao. 2024. Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation with Interpretability. TMM (2024).
  159. Xin Zhou and Zhiqi Shen. 2023. A tale of two graphs: Freezing and denoising graph structures for multimodal recommendation. In MM. 935–943.
  160. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In arXiv.
  161. TryOnDiffusion: A Tale of Two UNets. In CVPR. 4606–4615.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qijiong Liu (22 papers)
  2. Jieming Zhu (69 papers)
  3. Yanting Yang (10 papers)
  4. Quanyu Dai (39 papers)
  5. Zhaocheng Du (17 papers)
  6. Xiao-Ming Wu (91 papers)
  7. Zhou Zhao (219 papers)
  8. Rui Zhang (1140 papers)
  9. Zhenhua Dong (77 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com