Multimodal Pretraining and Generation for Recommendation: A Tutorial (2405.06927v1)
Abstract: Personalized recommendation stands as a ubiquitous channel for users to explore information or items aligned with their interests. Nevertheless, prevailing recommendation models predominantly rely on unique IDs and categorical features for user-item matching. While this ID-centric approach has witnessed considerable success, it falls short in comprehensively grasping the essence of raw item contents across diverse modalities, such as text, image, audio, and video. This underutilization of multimodal data poses a limitation to recommender systems, particularly in the realm of multimedia services like news, music, and short-video platforms. The recent surge in pretraining and generation techniques presents both opportunities and challenges in the development of multimodal recommender systems. This tutorial seeks to provide a thorough exploration of the latest advancements and future trajectories in multimodal pretraining and generation techniques within the realm of recommender systems. The tutorial comprises three parts: multimodal pretraining, multimodal generation, and industrial applications and open challenges in the field of recommendation. Our target audience encompasses scholars, practitioners, and other parties interested in this domain. By providing a succinct overview of the field, we aspire to facilitate a swift understanding of multimodal recommendation and foster meaningful discussions on the future development of this evolving landscape.
- Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS.
- Put Your Voice on Stage: Personalized Headline Generation for News Articles. TKDD 18, 3 (2023).
- PENS: A Dataset and Generic Framework for Personalized News Headline Generation. In Proceedings of ACL/IJCNLP. 82–92.
- ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest. In The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 2703–2711.
- Generating User-Engaging News Headlines. In Proceedings of ACL. 3265–3280.
- Learning Audio Embeddings with User Listening Data for Content-Based Music Recommendation. In ICASSP. 3015–3019.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023).
- TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design. In Proceedings of the 31st ACM International Conference on Multimedia (MM). 7236–7246.
- Image Matters: Visually Modeling User Behaviors Using Advanced Model Server. In CIKM. 2087–2095.
- Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). In RecSys. 299–315.
- VIP5: Towards Multimodal Foundation Models for Recommendation. CoRR.
- ImageBind: One Embedding Space to Bind Them All. In CVPR. 15180–15190.
- A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models. CoRR abs/2307.12980 (2023).
- Generating Representative Headlines for News Stories. In The Web Conference 2020 (WWW). 1773–1784.
- Recommendation Technologies for Multimedia Content. In Proceedings of ICMR. 8.
- Towards Universal Sequence Representation Learning for Recommender Systems. In KDD. 585–593.
- PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout. In CVPR. 6018–6026.
- MALA: Cross-Domain Dialogue Generation with Action Learning. In AAAI. 7977–7984.
- Sliding Spectrum Decomposition for Diversified Recommendation. In KDD. 3041–3049.
- Multi-Modal Recommender Systems: Towards Addressing Sparsity, Comparability, and Explainability. In ACM Web Conference (WWW).
- Automatic Prompt Rewriting for Personalized Text Generation. CoRR abs/2310.00152 (2023).
- Teach LLMs to Personalize - An Approach inspired by Writing Education. CoRR abs/2308.07968 (2023).
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML, Vol. 202. 19730–19742.
- Text Is All You Need: Learning Language Representations for Sequential Recommendation. In Proceedings of KDD. 1258–1267.
- MINER: Multi-Interest Matching Network for News Recommendation. In Findings of ACL. 343–352.
- Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. CoRR abs/2303.15647.
- Paul Pu Liang and Louis-Philippe Morency. 2023. Tutorial on Multimodal Machine Learning: Principles, Challenges, and Open Questions. In International Conference on Multimodal Interaction (ICMI). 101–104.
- AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation. In ACM MM. 1250–1260.
- Category-Specific CNN for Visual-aware CTR Prediction at JD.com. In The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 2686–2696.
- Boosting Deep CTR Prediction with a Plug-and-Play Pre-trainer for News Recommendation. In Proceedings of COLING. 2823–2833.
- Self-Supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 35, 1 (2023), 857–876.
- Pre-training Graph Transformer with Multimodal Side Information for Recommendation. In ACM MM. 2853–2861.
- Adapt in Contexts: Retrieval-Augmented Domain Adaptation via In-Context Learning. In Proceedings of EMNLP. 6525–6542.
- OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of ICML, Vol. 139. 8748–8763.
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In CVPR. 22500–22510.
- LaMP: When Large Language Models Meet Personalization. CoRR (2023).
- Multimedia Recommendation. In ACM Multimedia Conference (MM). 1535.
- Multimedia recommendation: technology and techniques. In SIGIR. 1131.
- PMG: Personalized Multimodal Generation with Large Language Models. In WWW.
- BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In Proceedings of CIKM. 1441–1450.
- Multi-Modal Recommender Systems: Hands-On Exploration. In RecSys. 834–837.
- MISSRec: Pre-training and Transferring Multi-modal Interest-aware Sequence Representation for Recommendation. In Proceedings of MM. 6548–6557.
- Multi-Modal Self-Supervised Learning for Recommendation. In WWW. 790–800.
- Enhancing Dynamic Image Advertising with Vision-Language Pre-training. In SIGIR. 3310–3314.
- UserBERT: Pre-training User Model with Contrastive Self-supervision. In SIGIR. 2087–2092.
- From Abstract to Details: A Generative Multimodal Fusion Framework for Recommendation. In ACM MM. 258–267.
- Why Do We Click: Visual Impression-aware News Recommendation. In ACM MM. 3881–3890.
- Controllable Textual Inversion for Personalized Text-to-Image Generation. CoRR abs/2304.05265 (2023).
- An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation. In ACL. 4918–4935.
- Contrastive Learning with Positive-Negative Frame Mask for Music Representation. In The ACM Web Conference 2022 (WWW). 2906–2915.
- MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer. In WWW.
- Self-supervised Learning for Large-scale Item Recommendations. In CIKM. 4321–4330.
- Boost CTR Prediction for New Advertisements via Modeling Visual Content. In IEEE International Conference on Big Data (BigData). 2140–2149.
- Emerging Topics on Personalized and Localized Multimedia Information Systems. In ACM MM. 1233–1234.
- Parameter-Efficient Transfer from Sequential Behaviors for User Modeling and Recommendation. In Proceedings of SIGIR. 1469–1478.
- Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning. CoRR (2023).
- UNBERT: User-News Matching BERT for News Recommendation. In Proceedings of IJCAI. 3356–3362.
- S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization. In CIKM. 1893–1902.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. CoRR abs/2304.10592 (2023).
- MusicBERT: A Self-supervised Learning of Music Representation. In ACM MM. 3955–3963.
- BARS: Towards Open Benchmarking for Recommender Systems. In SIGIR. 2912–2923.