What Makes Multimodal In-Context Learning Work? (2404.15736v2)
Abstract: LLMs have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl
- VQA: Visual Question Answering, 2016. arXiv:1505.00468 [cs].
- What learning algorithm is in-context learning? Investigations with linear models, 2023. arXiv:2211.15661 [cs].
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models, 2023. arXiv:2308.01390 [cs].
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Understanding and Improving In-Context Learning on Vision-language Models, 2023. arXiv:2311.18021 [cs].
- Microsoft COCO Captions: Data Collection and Evaluation Server, 2015. arXiv:1504.00325 [cs].
- Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
- A Survey on In-context Learning, 2022.
- Magma–multimodal augmentation of generative models through adapter-based finetuning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2416–2428, 2022.
- Retrieval-Augmented Generation for Large Language Models: A Survey, 2024. arXiv:2312.10997 [cs].
- KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 956–968, Seattle, United States, 2022. Association for Computational Linguistics.
- In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333, 2023.
- The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624, 2020.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024.
- Otter: A Multi-Modal Model with In-Context Instruction Tuning, 2023a. arXiv:2305.03726 [cs].
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Revive: Regional visual representation matters in knowledge-based visual question answering. Advances in Neural Information Processing Systems, 35:10560–10571, 2022.
- Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, 2022. Association for Computational Linguistics.
- Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022a.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
- Are Emergent Abilities in Large Language Models Just in-Context Learning?, 2023. arXiv:2309.01809 [cs].
- Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, 2022c. Association for Computational Linguistics.
- Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems, 36, 2024a.
- In-context Learning with Retrieved Demonstrations for Language Models: A Survey, 2024b. arXiv:2401.11624 [cs].
- Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
- Compositional chain-of-thought prompting for large multimodal models. arXiv preprint arXiv:2311.17076, 2023.
- 4m: Massively multimodal masked modeling. Advances in Neural Information Processing Systems, 36, 2024.
- ClipCap: CLIP Prefix for Image Captioning, 2021. arXiv:2111.09734 [cs].
- In-context Learning and Induction Heads, 2022. arXiv:2209.11895 [cs].
- OpenAI. GPT-4 Technical Report, 2024a. arXiv:2303.08774 [cs].
- OpenAI. Clip: Rendered sst2 dataset, 2024b. GitHub repository.
- What in-context learning “learns” in-context: Disentangling task recognition and task learning, 2023.
- When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks, 2023. arXiv:2311.08993 [cs].
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States, 2022. Association for Computational Linguistics.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
- Prompting large language models with answer heuristics for knowledge-based visual question answering. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14974–14983, 2023.
- ep-alm: Efficient perceptual augmentation of language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22056–22069, 2023a.
- Unival: Unified model for image, video, audio and language tasks. Transactions on Machine Learning Research Journal (TMLR), 2023b.
- Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
- Parsing With Compositional Vector Grammars. In EMNLP. 2013.
- Link-Context Learning for Multimodal LLMs, 2023. arXiv:2308.07891 [cs].
- Gemini Team. Gemini: A Family of Highly Capable Multimodal Models, 2023. arXiv:2312.11805 [cs].
- Multimodal Few-Shot Learning with Frozen Language Models. In Advances in Neural Information Processing Systems, pages 200–212. Curran Associates, Inc., 2021.
- Improved baselines for data-efficient perceptual augmentation of llms. arXiv preprint arXiv:2403.13499, 2024.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
- Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
- Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, 2023.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022a.
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, 2022b. arXiv:2108.10904 [cs].
- Do Prompt-Based Models Really Understand the Meaning of Their Prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344, Seattle, United States, 2022. Association for Computational Linguistics.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
- Symbol tuning improves in-context learning in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 968–979, Singapore, 2023a. Association for Computational Linguistics.
- Larger language models do in-context learning differently, 2023b.
- Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11445–11465, 2023.
- Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models, 2023. arXiv:2311.00871 [cs, stat] version: 1.
- An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
- Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2422–2437, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
- CoCa: Contrastive Captioners are Image-Text Foundation Models, 2022. arXiv:2205.01917 [cs].
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, 2023. arXiv:2311.16502 [cs].
- MM-LLMs: Recent Advances in MultiModal Large Language Models, 2024a. arXiv:2401.13601 [cs].
- On the Out-Of-Distribution Generalization of Multimodal Large Language Models, 2024b. arXiv:2402.06599 [cs].
- MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning, 2023. arXiv:2309.07915 [cs].
- MMICL: Empowering vision-language model with multi-modal in-context learning. In The Twelfth International Conference on Learning Representations, 2024.
- Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning, pages 12697–12706. PMLR, 2021. ISSN: 2640-3498.
- Folco Bertini Baldassini (2 papers)
- Mustafa Shukor (27 papers)
- Matthieu Cord (129 papers)
- Laure Soulier (39 papers)
- Benjamin Piwowarski (38 papers)