Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond (2403.10667v2)

Published 15 Mar 2024 in cs.IR, cs.AI, cs.CL, and cs.MM

Abstract: Developing a universal model that can effectively harness heterogeneous resources and respond to a wide range of personalized needs has been a longstanding community aspiration. Our daily choices, especially in domains like fashion and retail, are substantially shaped by multi-modal data, such as pictures and textual descriptions. These modalities not only offer intuitive guidance but also cater to personalized user preferences. However, the predominant personalization approaches mainly focus on the ID or text-based recommendation problem, failing to comprehend the information spanning various tasks or modalities. In this paper, our goal is to establish a Unified paradigm for Multi-modal Personalization systems (UniMP), which effectively leverages multi-modal data while eliminating the complexities associated with task- and modality-specific customization. We argue that the advancements in foundational generative modeling have provided the flexibility and effectiveness necessary to achieve the objective. In light of this, we develop a generic and extensible personalization generative framework, that can handle a wide range of personalized needs including item recommendation, product search, preference prediction, explanation generation, and further user-guided image generation. Our methodology enhances the capabilities of foundational LLMs for personalized tasks by seamlessly ingesting interleaved cross-modal user history information, ensuring a more precise and customized experience for users. To train and evaluate the proposed multi-modal personalized tasks, we also introduce a novel and comprehensive benchmark covering a variety of user requirements. Our experiments on the real-world benchmark showcase the model's potential, outperforming competitive methods specialized for each task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Learning a hierarchical embedding model for personalized product search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  645–654, 2017.
  2. A zero attention model for personalized product search. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp.  379–388, 2019.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  5. EE-net: Exploitation-exploration neural networks in contextual bandits. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=X_ch3VrNSRg.
  6. Adaptive test-time personalization for federated learning. Advances in Neural Information Processing Systems, 36, 2024.
  7. Information filtering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12):29–38, 1992.
  8. Learning a fine-grained review-based transformer model for personalized product search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  123–132, 2021.
  9. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  10. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pp.  1931–1942. PMLR, 2021.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12873–12883, 2021.
  15. A view-adversarial framework for multi-view network embedding. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp.  2025–2028, 2020.
  16. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems, pp.  299–315, 2022.
  17. Vip5: Towards multimodal foundation models for recommendation. arXiv preprint arXiv:2305.14302, 2023.
  18. Sancl: Multimodal review helpfulness prediction with selective attention and natural contrastive learning. In Proceedings of the 29th International Conference on Computational Linguistics, pp.  5666–5677, 2022.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  20. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp.  507–517, 2016a.
  21. Vbpr: visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016b.
  22. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  639–648, 2020.
  23. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  24. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  585–593, 2022.
  25. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  26. Recommendation system development for fashion retail e-commerce. Electronic Commerce Research and Applications, 28:94–101, 2018.
  27. When recurrent neural networks meet the neighborhood for session-based recommendation. In Proceedings of the eleventh ACM conference on recommender systems, pp.  306–310, 2017.
  28. Language models as semantic indexers. arXiv preprint arXiv:2310.07815, 2023.
  29. Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM), pp.  197–206. IEEE, 2018.
  30. Visually-aware fashion recommendation and design with generative image models. In 2017 IEEE international conference on data mining (ICDM), pp.  207–216. IEEE, 2017.
  31. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  32. Text is all you need: Learning language representations for sequential recommendation. arXiv preprint arXiv:2305.13731, 2023.
  33. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp.  2980–2988, 2017.
  34. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  35. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  36. Query attribute recommendation at amazon search. In Proceedings of the 16th ACM Conference on Recommender Systems, pp.  506–508, 2022.
  37. Hierarchical gating networks for sequential recommendation. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  825–833, 2019.
  38. Ultragcn: ultra simplification of graph convolutional networks for recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp.  1253–1262, 2021.
  39. UserIdentifier: Implicit user representations for simple and effective personalized sentiment analysis. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3449–3456, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.252. URL https://aclanthology.org/2022.naacl-main.252.
  40. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp.  188–197, 2019.
  41. Graph neural bandits. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  1920–1931, 2023.
  42. Meta-learning with neural bandit scheduler. Advances in Neural Information Processing Systems, 36, 2024.
  43. Causalrec: Causal inference for visual debiasing in visually-aware recommendation. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  3844–3852, 2021.
  44. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  45. Recommender systems with generative retrieval. arXiv preprint arXiv:2305.05065, 2023.
  46. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp.  8821–8831. PMLR, 2021.
  47. Bpr: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618, 2012.
  48. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  49. Lamp: When large language models meet personalization. arXiv preprint arXiv:2304.11406, 2023.
  50. Human language modeling. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  622–636, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.52. URL https://aclanthology.org/2022.findings-acl.52.
  51. Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management, pp.  1441–1450, 2019.
  52. On the role of server momentum in federated learning. arXiv preprint arXiv:2312.12670, 2023.
  53. Adversarial training towards robust multimedia recommender system. IEEE Transactions on Knowledge and Data Engineering, 32(5):855–867, 2019.
  54. Together.ai. Releasing 3b and 7b redpajamaincite family of models including base, instruction-tuned & chat models. 2023. URL https://www.together.xyz/blog/redpajama-models-v1.
  55. Multimodal review generation for recommender systems. In The World Wide Web Conference, pp.  1864–1874, 2019.
  56. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pp.  23318–23340. PMLR, 2022.
  57. Comprehensive fair meta-learned recommender system. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  1989–1999, 2022.
  58. Fast adaptation for cold-start collaborative filtering with meta-learning. In 2020 IEEE International Conference on Data Mining (ICDM), pp.  661–670. IEEE, 2020.
  59. Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp.  1791–1800, 2021.
  60. Ntk-approximating mlp fusion for efficient language model fine-tuning. In International Conference on Machine Learning, pp.  36821–36838. PMLR, 2023a.
  61. Multi-modal self-supervised learning for recommendation. In Proceedings of the ACM Web Conference 2023, pp.  790–800, 2023b.
  62. Mmgcn: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM international conference on multimedia, pp.  1437–1445, 2019.
  63. Leveraging similar users for personalized language modeling with limited data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1742–1752, 2022.
  64. Solving a class of non-convex minimax optimization in federated learning. Advances in Neural Information Processing Systems, 36, 2024.
  65. Multi-modal graph contrastive learning for micro-video recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  1807–1811, 2022.
  66. Scalable and effective generative information retrieval. arXiv preprint arXiv:2311.09134, 2023.
  67. Useradapter: Few-shot user learning in sentiment analysis. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  1484–1488, 2021.
  68. S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management, pp.  1893–1902, 2020.
  69. Beam-stack search: Integrating backtracking with beam search. In ICAPS, pp.  90–98, 2005.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Tianxin Wei (23 papers)
  2. Bowen Jin (45 papers)
  3. Ruirui Li (33 papers)
  4. Hansi Zeng (18 papers)
  5. Zhengyang Wang (48 papers)
  6. Jianhui Sun (14 papers)
  7. Qingyu Yin (44 papers)
  8. Hanqing Lu (34 papers)
  9. Suhang Wang (118 papers)
  10. Jingrui He (87 papers)
  11. Xianfeng Tang (62 papers)
Citations (6)