Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Makes Multimodal In-Context Learning Work? (2404.15736v2)

Published 24 Apr 2024 in cs.CV and cs.AI

Abstract: LLMs have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at https://gitlab.com/folbaeni/multimodal-icl

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. VQA: Visual Question Answering, 2016. arXiv:1505.00468 [cs].
  2. What learning algorithm is in-context learning? Investigations with linear models, 2023. arXiv:2211.15661 [cs].
  3. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  4. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  5. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models, 2023. arXiv:2308.01390 [cs].
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Understanding and Improving In-Context Learning on Vision-language Models, 2023. arXiv:2311.18021 [cs].
  8. Microsoft COCO Captions: Data Collection and Evaluation Server, 2015. arXiv:1504.00325 [cs].
  9. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005–4019, 2023.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  11. A Survey on In-context Learning, 2022.
  12. Magma–multimodal augmentation of generative models through adapter-based finetuning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2416–2428, 2022.
  13. Retrieval-Augmented Generation for Large Language Models: A Survey, 2024. arXiv:2312.10997 [cs].
  14. KAT: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 956–968, Seattle, United States, 2022. Association for Computational Linguistics.
  15. In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9318–9333, 2023.
  16. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624, 2020.
  17. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  18. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024.
  19. Otter: A Multi-Modal Model with In-Context Instruction Tuning, 2023a. arXiv:2305.03726 [cs].
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
  21. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  22. Revive: Regional visual representation matters in knowledge-based visual question answering. Advances in Neural Information Processing Systems, 35:10560–10571, 2022.
  23. Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  24. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  25. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, 2022. Association for Computational Linguistics.
  26. Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022a.
  27. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
  28. Are Emergent Abilities in Large Language Models Just in-Context Learning?, 2023. arXiv:2309.01809 [cs].
  29. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, 2022c. Association for Computational Linguistics.
  30. Cheap and quick: Efficient vision-language instruction tuning for large language models. Advances in Neural Information Processing Systems, 36, 2024a.
  31. In-context Learning with Retrieved Demonstrations for Language Models: A Survey, 2024b. arXiv:2401.11624 [cs].
  32. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
  33. Compositional chain-of-thought prompting for large multimodal models. arXiv preprint arXiv:2311.17076, 2023.
  34. 4m: Massively multimodal masked modeling. Advances in Neural Information Processing Systems, 36, 2024.
  35. ClipCap: CLIP Prefix for Image Captioning, 2021. arXiv:2111.09734 [cs].
  36. In-context Learning and Induction Heads, 2022. arXiv:2209.11895 [cs].
  37. OpenAI. GPT-4 Technical Report, 2024a. arXiv:2303.08774 [cs].
  38. OpenAI. Clip: Rendered sst2 dataset, 2024b. GitHub repository.
  39. What in-context learning “learns” in-context: Disentangling task recognition and task learning, 2023.
  40. When does In-context Learning Fall Short and Why? A Study on Specification-Heavy Tasks, 2023. arXiv:2311.08993 [cs].
  41. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  43. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States, 2022. Association for Computational Linguistics.
  44. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  45. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  46. Prompting large language models with answer heuristics for knowledge-based visual question answering. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14974–14983, 2023.
  47. ep-alm: Efficient perceptual augmentation of language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22056–22069, 2023a.
  48. Unival: Unified model for image, video, audio and language tasks. Transactions on Machine Learning Research Journal (TMLR), 2023b.
  49. Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
  50. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  51. Parsing With Compositional Vector Grammars. In EMNLP. 2013.
  52. Link-Context Learning for Multimodal LLMs, 2023. arXiv:2308.07891 [cs].
  53. Gemini Team. Gemini: A Family of Highly Capable Multimodal Models, 2023. arXiv:2312.11805 [cs].
  54. Multimodal Few-Shot Learning with Frozen Language Models. In Advances in Neural Information Processing Systems, pages 200–212. Curran Associates, Inc., 2021.
  55. Improved baselines for data-efficient perceptual augmentation of llms. arXiv preprint arXiv:2403.13499, 2024.
  56. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  57. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
  58. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, 2023.
  59. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022a.
  60. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, 2022b. arXiv:2108.10904 [cs].
  61. Do Prompt-Based Models Really Understand the Meaning of Their Prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344, Seattle, United States, 2022. Association for Computational Linguistics.
  62. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
  63. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
  64. Symbol tuning improves in-context learning in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 968–979, Singapore, 2023a. Association for Computational Linguistics.
  65. Larger language models do in-context learning differently, 2023b.
  66. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11445–11465, 2023.
  67. Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models, 2023. arXiv:2311.00871 [cs, stat] version: 1.
  68. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
  69. Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2422–2437, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
  70. CoCa: Contrastive Captioners are Image-Text Foundation Models, 2022. arXiv:2205.01917 [cs].
  71. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, 2023. arXiv:2311.16502 [cs].
  72. MM-LLMs: Recent Advances in MultiModal Large Language Models, 2024a. arXiv:2401.13601 [cs].
  73. On the Out-Of-Distribution Generalization of Multimodal Large Language Models, 2024b. arXiv:2402.06599 [cs].
  74. MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning, 2023. arXiv:2309.07915 [cs].
  75. MMICL: Empowering vision-language model with multi-modal in-context learning. In The Twelfth International Conference on Learning Representations, 2024.
  76. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning, pages 12697–12706. PMLR, 2021. ISSN: 2640-3498.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Folco Bertini Baldassini (2 papers)
  2. Mustafa Shukor (27 papers)
  3. Matthieu Cord (129 papers)
  4. Laure Soulier (39 papers)
  5. Benjamin Piwowarski (38 papers)
Citations (12)