All in an Aggregated Image for In-Image Learning (2402.17971v2)
Abstract: This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (I$2$L) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into LLMs, I$2$L consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce I$2$L-Hybrid, a method that combines the strengths of I$2$L with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (I$2$L or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of I$2$L and I$2$L-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of I$2$L. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Anthropic. 2023. Claude 2.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- Google. 2023. Bard.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
- From images to textual prompts: Zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846.
- Icl-d3ie: In-context learning with diverse demonstrations updating for document information extraction. arXiv preprint arXiv:2303.05063.
- Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699.
- Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- OBELICS: An open web-scale filtered dataset of interleaved image-text documents.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
- Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
- Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
- OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
- OpenAI. 2023a. GPT-4 technical report. CoRR, abs/2303.08774.
- OpenAI. 2023b. Gpt-4v(ision) system card.
- Xricl: Cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-sql semantic parsing. arXiv preprint arXiv:2210.13693.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441.
- An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
- Ground-truth labels matter: A deeper look into input-label demonstrations. arXiv preprint arXiv:2205.12685.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Lost in translation: When gpt-4v (ision) can’t see eye to eye with text. a vision-language-consistency analysis of vllms and beyond. arXiv preprint arXiv:2310.12520.
- Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361.
- LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
- Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Lei Wang (975 papers)
- Wanyu Xu (4 papers)
- Zhiqiang Hu (48 papers)
- Yihuai Lan (8 papers)
- Shan Dong (6 papers)
- Hao Wang (1119 papers)
- Roy Ka-Wei Lee (68 papers)
- Ee-Peng Lim (57 papers)