Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models (2403.19322v2)

Published 28 Mar 2024 in cs.CV and cs.CL

Abstract: The rise of Multimodal LLMs (MLLMs), renowned for their advanced instruction-following and reasoning capabilities, has significantly propelled the field of visual reasoning. However, due to limitations in their image tokenization processes, most MLLMs struggle to capture fine details of text and objects in images, especially in high-resolution samples. To overcome this limitation, we introduce P2G, a novel framework for plug-and-play grounding in MLLMs. P2G utilizes the tool-usage potential of MLLMs to employ expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images, thereby enabling deliberate reasoning through multimodal prompting. Additionally, we develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images. Extensive experiments on visual reasoning tasks demonstrate the superiority of P2G, achieving performance comparable to GPT-4V on P2GB with a 7B backbone. Our work underscores the potential of grounding reasoning with external agents in MLLMs, presenting a promising alternative to mere model scaling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Paddleocr, 2022. https://github.com/PaddlePaddle/PaddleOCR.
  2. Flamingo: a visual language model for few-shot learning. NIPS (2022).
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
  4. Language models are few-shot learners. NIPS (2020).
  5. Can ai assistants know what they don’t know? arXiv:2401.13275 (2024).
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NIPS (2023).
  8. A survey for in-context learning. arXiv:2301.00234 (2022).
  9. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  10. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023).
  11. Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997 (2023).
  12. Bliva: A simple multimodal llm for better handling of text-rich visual questions. AAAI (2024).
  13. Language is not all you need: Aligning perception with language models. NIPS (2024).
  14. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR (2019).
  15. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv:2307.16125 (2023).
  16. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (2023), PMLR, pp. 19730–19742.
  17. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. Intelligent Computing (2024).
  18. Improved baselines with visual instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following (2023).
  19. Visual instruction tuning. NIPS (2024).
  20. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024).
  21. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499 (2023).
  22. What large language models bring to text-rich vqa? arXiv:2311.07306 (2023).
  23. The flan collection: Designing data and methods for effective instruction tuning. ICML (2023).
  24. Deepseek-vl: Towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024).
  25. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ACL (2022).
  26. Docvqa: A dataset for vqa on document images. In CVPR (2021).
  27. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707 (2023).
  28. OpenAI. Gpt-4 technical report. arXiv:2303.08774 (2023).
  29. Training language models to follow instructions with human feedback. NIPS (2022).
  30. Generating images in context with multimodal large language models. In The Twelfth International Conference on Learning Representations (2023).
  31. Grounding multimodal large language models to the world. In ICLR (2024).
  32. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. NIPS (2024).
  33. Multimodal instruction tuning with conditional mixture of lora. arXiv preprint arXiv:2402.15896 (2024).
  34. A survey of reasoning with foundation models. arXiv:2312.11562 (2023).
  35. Llama: Open and efficient foundation language models. arXiv:2302.13971 (2023).
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 (2023).
  37. Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models. arXiv:2401.13311 (2024).
  38. Cogvlm: Visual expert for pretrained language models. arXiv:2311.03079 (2023).
  39. Self-consistency improves chain of thought reasoning in language models. In ICLR (2023).
  40. Towards improving document understanding: An exploration on text-grounding via mllms. arXiv:2311.13194 (2023).
  41. V*: Guided visual search as a core mechanism in multimodal llms. arXiv:2312.14135 (2023).
  42. Next-gpt: Any-to-any multimodal llm. arXiv:2309.05519 (2023).
  43. mplug-owl: Modularization empowers large language models with multimodality. CVPR (2024).
  44. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490 (2023).
  45. Enhanced visual instruction tuning for text-rich image understanding. In NIPS Workshop (2023).
  46. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR (2024).
  47. Toolqa: A dataset for llm question answering with external tools. NIPS (2024).
Citations (8)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.