Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

All in an Aggregated Image for In-Image Learning (2402.17971v2)

Published 28 Feb 2024 in cs.CV, cs.AI, and cs.CL

Abstract: This paper introduces a new in-context learning (ICL) mechanism called In-Image Learning (I$2$L) that combines demonstration examples, visual cues, and chain-of-thought reasoning into an aggregated image to enhance the capabilities of Large Multimodal Models (e.g., GPT-4V) in multimodal reasoning tasks. Unlike previous approaches that rely on converting images to text or incorporating visual input into LLMs, I$2$L consolidates all information into an aggregated image and leverages image processing, understanding, and reasoning abilities. This has several advantages: it reduces inaccurate textual descriptions of complex images, provides flexibility in positioning demonstration examples, and avoids multiple input images and lengthy prompts. We also introduce I$2$L-Hybrid, a method that combines the strengths of I$2$L with other ICL methods. Specifically, it uses an automatic strategy to select the most suitable method (I$2$L or another certain ICL method) for a specific task instance. We conduct extensive experiments to assess the effectiveness of I$2$L and I$2$L-Hybrid on MathVista, which covers a variety of complex multimodal reasoning tasks. Additionally, we investigate the influence of image resolution, the number of demonstration examples in a single image, and the positions of these demonstrations in the aggregated image on the effectiveness of I$2$L. Our code is publicly available at https://github.com/AGI-Edgerunners/IIL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Anthropic. 2023. Claude 2.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  8. InstructBLIP: Towards general-purpose vision-language models with instruction tuning.
  9. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  10. Google. 2023. Bard.
  11. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
  12. From images to textual prompts: Zero-shot vqa with frozen large language models. arXiv preprint arXiv:2212.10846.
  13. Icl-d3ie: In-context learning with diverse demonstrations updating for document information extraction. arXiv preprint arXiv:2303.05063.
  14. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699.
  15. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839.
  16. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  17. OBELICS: An open web-scale filtered dataset of interleaved image-text documents.
  18. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  20. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
  21. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565.
  22. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  23. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  24. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  25. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  26. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
  27. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
  28. Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686.
  29. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
  30. OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
  31. OpenAI. 2023a. GPT-4 technical report. CoRR, abs/2303.08774.
  32. OpenAI. 2023b. Gpt-4v(ision) system card.
  33. Xricl: Cross-lingual retrieval-augmented in-context learning for cross-lingual text-to-sql semantic parsing. arXiv preprint arXiv:2210.13693.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  36. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
  37. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  39. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441.
  40. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089.
  41. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9.
  42. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  43. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
  44. Ground-truth labels matter: A deeper look into input-label demonstrations. arXiv preprint arXiv:2205.12685.
  45. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  47. Lost in translation: When gpt-4v (ision) can’t see eye to eye with text. a vision-language-consistency analysis of vllms and beyond. arXiv preprint arXiv:2310.12520.
  48. Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361.
  49. LLaVAR: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
  50. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
  51. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239.
  52. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Lei Wang (975 papers)
  2. Wanyu Xu (4 papers)
  3. Zhiqiang Hu (48 papers)
  4. Yihuai Lan (8 papers)
  5. Shan Dong (6 papers)
  6. Hao Wang (1119 papers)
  7. Roy Ka-Wei Lee (68 papers)
  8. Ee-Peng Lim (57 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com