Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Multimodal In-Context Learning for Vision & Language Models (2403.12736v2)

Published 19 Mar 2024 in cs.CV

Abstract: State-of-the-art Vision-LLMs (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the LLM decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot demonstrations (in an ICL way), likely due to their lack of direct ICL instruction tuning. To enhance the ICL abilities of the present VLM, we propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. Furthermore, we also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sivan Doveh (20 papers)
  2. Shaked Perek (3 papers)
  3. M. Jehanzeb Mirza (15 papers)
  4. Amit Alfassy (9 papers)
  5. Assaf Arbelle (26 papers)
  6. Shimon Ullman (32 papers)
  7. Leonid Karlinsky (79 papers)
  8. Wei Lin (207 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com