Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models (2403.19322v2)
Abstract: The rise of Multimodal LLMs (MLLMs), renowned for their advanced instruction-following and reasoning capabilities, has significantly propelled the field of visual reasoning. However, due to limitations in their image tokenization processes, most MLLMs struggle to capture fine details of text and objects in images, especially in high-resolution samples. To overcome this limitation, we introduce P2G, a novel framework for plug-and-play grounding in MLLMs. P2G utilizes the tool-usage potential of MLLMs to employ expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images, thereby enabling deliberate reasoning through multimodal prompting. Additionally, we develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images. Extensive experiments on visual reasoning tasks demonstrate the superiority of P2G, achieving performance comparable to GPT-4V on P2GB with a 7B backbone. Our work underscores the potential of grounding reasoning with external agents in MLLMs, presenting a promising alternative to mere model scaling.
- Paddleocr, 2022. https://github.com/PaddlePaddle/PaddleOCR.
- Flamingo: a visual language model for few-shot learning. NIPS (2022).
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
- Language models are few-shot learners. NIPS (2020).
- Can ai assistants know what they don’t know? arXiv:2401.13275 (2024).
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In NIPS (2023).
- A survey for in-context learning. arXiv:2301.00234 (2022).
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023).
- Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997 (2023).
- Bliva: A simple multimodal llm for better handling of text-rich visual questions. AAAI (2024).
- Language is not all you need: Aligning perception with language models. NIPS (2024).
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR (2019).
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv:2307.16125 (2023).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (2023), PMLR, pp. 19730–19742.
- Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. Intelligent Computing (2024).
- Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following (2023).
- Visual instruction tuning. NIPS (2024).
- A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024).
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499 (2023).
- What large language models bring to text-rich vqa? arXiv:2311.07306 (2023).
- The flan collection: Designing data and methods for effective instruction tuning. ICML (2023).
- Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024).
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ACL (2022).
- Docvqa: A dataset for vqa on document images. In CVPR (2021).
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707 (2023).
- OpenAI. Gpt-4 technical report. arXiv:2303.08774 (2023).
- Training language models to follow instructions with human feedback. NIPS (2022).
- Generating images in context with multimodal large language models. In The Twelfth International Conference on Learning Representations (2023).
- Grounding multimodal large language models to the world. In ICLR (2024).
- Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. NIPS (2024).
- Multimodal instruction tuning with conditional mixture of lora. arXiv preprint arXiv:2402.15896 (2024).
- A survey of reasoning with foundation models. arXiv:2312.11562 (2023).
- Llama: Open and efficient foundation language models. arXiv:2302.13971 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 (2023).
- Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models. arXiv:2401.13311 (2024).
- Cogvlm: Visual expert for pretrained language models. arXiv:2311.03079 (2023).
- Self-consistency improves chain of thought reasoning in language models. In ICLR (2023).
- Towards improving document understanding: An exploration on text-grounding via mllms. arXiv:2311.13194 (2023).
- V*: Guided visual search as a core mechanism in multimodal llms. arXiv:2312.14135 (2023).
- Next-gpt: Any-to-any multimodal llm. arXiv:2309.05519 (2023).
- mplug-owl: Modularization empowers large language models with multimodality. CVPR (2024).
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490 (2023).
- Enhanced visual instruction tuning for text-rich image understanding. In NIPS Workshop (2023).
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR (2024).
- Toolqa: A dataset for llm question answering with external tools. NIPS (2024).
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.