NoteLLM-2: Multimodal Large Representation Models for Recommendation (2405.16789v1)
Abstract: LLMs have demonstrated exceptional text understanding. Existing works explore their application in text embedding tasks. However, there are few works utilizing LLMs to assist multimodal representation tasks. In this work, we investigate the potential of LLMs to enhance multimodal representation in multimodal item-to-item (I2I) recommendations. One feasible method is the transfer of Multimodal LLMs (MLLMs) for representation tasks. However, pre-training MLLMs usually requires collecting high-quality, web-scale multimodal data, resulting in complex training procedures and high costs. This leads the community to rely heavily on open-source MLLMs, hindering customized training for representation scenarios. Therefore, we aim to design an end-to-end training method that customizes the integration of any existing LLMs and vision encoders to construct efficient multimodal representation models. Preliminary experiments show that fine-tuned LLMs in this end-to-end method tend to overlook image content. To overcome this challenge, we propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation. We propose two ways to enhance the focus on visual information. The first method is based on the prompt viewpoint, which separates multimodal content into visual content and textual content. NoteLLM-2 adopts the multimodal In-Content Learning method to teach LLMs to focus on both modalities and aggregate key information. The second method is from the model architecture, utilizing a late fusion mechanism to directly fuse visual information into textual information. Extensive experiments have been conducted to validate the effectiveness of our method.
- Llama3. https://llama.meta.com/llama3/.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961.
- TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model. TKDD.
- On the Difference of BERT-style and CLIP-style Text Encoders. In Findings of ACL, pages 13710–13721.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Pre-training with whole word masking for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3504–3514.
- MAPS: multimodal attention for product similarity. In WACV, pages 3338–3346.
- DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.
- A survey on in-context learning. arXiv preprint arXiv:2301.00234.
- DreamLLM: Synergistic Multimodal Comprehension and Creation. In The Twelfth International Conference on Learning Representations.
- An empirical study of training end-to-end vision-and-language transformers. In CVPR, pages 18166–18176.
- Planting a SEED of Vision in Large Language Model. In ICLR.
- Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks. In CVPR, pages 2669–2680.
- OpenCLIP. If you use this software, please cite it as below.
- Mistral 7B. arXiv preprint arXiv:2310.06825.
- Scaling sentence embeddings with large language models. arXiv preprint arXiv:2307.16645.
- Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce. In CVPR, pages 11060–11069.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
- Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pages 4171–4186.
- What matters when building vision-language models? arXiv preprint arXiv:2405.02246.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742. PMLR.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900. PMLR.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 34:9694–9705.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Multimodal recommender systems: A survey. arXiv preprint arXiv:2302.03883.
- Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424.
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training. arXiv preprint arXiv:2403.09611.
- Generative representational instruction tuning. arXiv preprint arXiv:2402.09906.
- MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pages 3505–3506.
- {{\{{Zero-offload}}\}}: Democratizing {{\{{billion-scale}}\}} model training. In USENIX Annual Technical Conference, pages 551–564.
- The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Repetition Improves Language Model Embeddings. arXiv preprint arXiv:2402.15449.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355.
- RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In COLING, pages 1852–1862.
- RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In AAAI, volume 35, pages 13860–13868.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
- EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters. arXiv preprint arXiv:2402.04252.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning. In EMNLP, pages 9840–9855.
- Image as a foreign language: Beit pretraining for vision and vision-language tasks. In CVPR, pages 19175–19186.
- MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In MM, pages 1437–1445.
- Mm-rec: Visiolinguistic model empowered multimodal news recommendation. In SIGIR, pages 2560–2564.
- Large language models for generative information extraction: A survey. arXiv preprint arXiv:2312.17617.
- Why do we click: visual impression-aware news recommendation. In MM, pages 3881–3890.
- Large scale product graph construction for recommendation in e-commerce. arXiv preprint arXiv:2010.05525.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
- Commercemm: Large-scale commerce multimodal representation learning with omni retrieval. In KDD, pages 4433–4442.
- Sigmoid loss for language image pre-training. In ICCV, pages 11975–11986.
- Long-clip: Unlocking the long-text capability of clip. arXiv preprint arXiv:2403.15378.
- NoteLLM: A Retrievable Large Language Model for Note Recommendation. arXiv preprint arXiv:2403.01744.
- Multimodal intelligence: Representation learning, information fusion, and applications. IEEE Journal of Selected Topics in Signal Processing, 14(3):478–493.
- Debiasing Large Visual Language Models. arXiv preprint arXiv:2403.05262.
- Make: Vision-language pre-training based product retrieval in taobao search. In WWW, pages 356–360.
- Learning tree-based deep model for recommender systems. In KDD, pages 1079–1088.
- Chao Zhang (907 papers)
- Haoxin Zhang (7 papers)
- Shiwei Wu (38 papers)
- Di Wu (477 papers)
- Tong Xu (113 papers)
- Yan Gao (157 papers)
- Yao Hu (106 papers)
- Enhong Chen (242 papers)