VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning (2403.13164v2)
Abstract: LLMs famously exhibit emergent in-context learning (ICL) -- the ability to rapidly adapt to new tasks using few-shot examples provided as a prompt, without updating the model's weights. Built on top of LLMs, vision LLMs (VLLMs) have advanced significantly in areas such as recognition, reasoning, and grounding. However, investigations into \emph{multimodal ICL} have predominantly focused on few-shot visual question answering (VQA), and image captioning, which we will show neither exploit the strengths of ICL, nor test its limitations. The broader capabilities and limitations of multimodal ICL remain under-explored. In this study, we introduce a comprehensive benchmark VL-ICL Bench for multimodal in-context learning, encompassing a broad spectrum of tasks that involve both images and text as inputs and outputs, and different types of challenges, from {perception to reasoning and long context length}. We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite, revealing their diverse strengths and weaknesses, and showing that even the most advanced models, such as GPT-4, find the tasks challenging. By highlighting a range of new ICL tasks, and the associated strengths and limitations of existing models, we hope that our dataset will inspire future work on enhancing the in-context learning capabilities of VLLMs, as well as inspire new applications that leverage VLLM ICL. The code and dataset are available at https://github.com/ys-zong/VL-ICL.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Visual prompting via image inpainting. NeurIPS, 2022.
- Language models are few-shot learners. NeurIPS, 2020.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Making llama see and draw with seed tokenizer. ICLR, 2024.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, pages 3608–3617, 2018.
- Measuring massive multitask language understanding. ICLR, 2021.
- Meta-Learning in Neural Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. doi: 10.1109/TPAMI.2021.3079209.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
- Generating images with multimodal language models. NeurIPS, 2023.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents. NeurIPS, 36, 2023.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023c.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. NeurIPS, 2023b.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Lost in the middle: How language models use long contexts. TACL, 2024b.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR), 2024.
- Metaicl: Learning to learn in context. In NACL, 2022.
- R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022.
- TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In CVPR, 2021.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. TMLR, 2023.
- Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023a.
- Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023b.
- Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
- Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
- T-NER: An all-round python library for transformer-based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2021.
- Matching networks for one shot learning. In NeurIPS, 2016.
- Building a question answering test collection. In SIGIR, 2000.
- Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023a.
- Seggpt: Towards segmenting everything in context. In ICCV, 2023b.
- Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv., 53(3), June 2020. doi: 10.1145/3386252.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- Can mllms perform text-to-image in-context learning? arXiv preprint arXiv:2402.01293, 2024.
- Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
- Character-level convolutional networks for text classification. NeurIPS, 28, 2015.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
- Yongshuo Zong (11 papers)
- Ondrej Bohdal (19 papers)
- Timothy Hospedales (101 papers)