Papers
Topics
Authors
Recent
Search
2000 character limit reached

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Published 27 May 2024 in cs.CV, cs.AI, and cs.CL | (2405.16919v3)

Abstract: While large multi-modal models (LMMs) have exhibited impressive capabilities across diverse tasks, their effectiveness in handling complex tasks has been limited by the prevailing single-step reasoning paradigm. To this end, this paper proposes VoCoT, a multi-step Visually grounded object-centric Chain-of-Thought reasoning framework tailored for inference with LMMs. VoCoT is characterized by two key features: (1) object-centric reasoning paths that revolve around cross-modal shared object-level information, and (2) visually grounded representation of object concepts in a multi-modal interleaved and aligned manner, which effectively bridges the modality gap within LMMs during long-term generation. To adapt LMMs in reasoning with VoCoT, we further construct an instruction-tuning dataset. By combining VoCoT with the prevalent open-source LMM architectures, we develop a VoCoT-based model, VolCano. With only 7B parameters and limited input image resolution, VolCano demonstrates excellent performance across various scenarios. In benchmarks like CLEVR and EmbSpatial, which highly require complex reasoning capabilities, VolCano outperforms SOTA models, including powerful GPT-4V. Related code, data and models are released in https://github.com/RupertLuo/VoCoT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Flamingo: a visual language model for few-shot learning. NIPS, 35:23716–23736, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:2308.12966, 2023.
  3. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024.
  4. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024.
  5. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv:2310.09478, 2023.
  6. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023.
  7. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2023.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  10. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models, March 2024.
  11. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv:2304.15010, 2023.
  12. LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  13. GREC: Generalized referring expression comprehension. arXiv preprint arXiv:2308.16182, 2023.
  14. Bliva: A simple multimodal llm for better handling of text-rich visual questions. arXiv:2308.09936, 2023.
  15. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019.
  16. Mistral 7b, 2023.
  17. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, pages 2901–2910, 2017.
  18. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798, 2014.
  19. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  20. Scaffolding coordinates to promote vision-language coordination in large multi-modal models, 2024.
  21. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv:2307.16125, 2023.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023.
  23. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning, 2023.
  24. Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023.
  25. Reform-eval: Evaluating large vision language models via unified re-formulation of task-oriented benchmarks. arXiv preprint arXiv:2310.02569, 2023.
  26. Lego: Language enhanced multi-modal grounding model. arXiv preprint arXiv:2401.06071, 2024.
  27. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv:2311.06607, 2023.
  28. Visual spatial reasoning. TACL, 11:635–651, 2023.
  29. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023.
  30. Llava-next: Improved reasoning, ocr, and world knowledge, 2024.
  31. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  32. Clevr-ref+: Diagnosing visual reasoning with referring expressions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4185–4194, 2019.
  33. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023.
  34. Deepseek-vl: Towards real-world vision-language understanding. arXiv:2403.05525, 2024.
  35. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
  36. OpenAI. Chatgpt (august 3 version), 2023.
  37. OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023.
  38. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
  39. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  40. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  41. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  42. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
  43. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
  44. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, pages 5238–5248, 2022.
  45. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  47. To see is to believe: Prompting gpt-4v for better visual instruction tuning, 2023.
  48. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv:2311.07397, 2023.
  49. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  50. Self-consistency improves chain of thought reasoning in language models, 2023.
  51. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  52. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  53. P. Wu and S. Xie. V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135, 2023.
  54. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023.
  55. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
  56. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  57. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  58. mplug-owl: Modularization empowers large language models with multimodality. arXiv:2304.14178, 2023.
  59. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
  60. Automatic chain of thought prompting in large language models, 2022.
  61. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  62. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
  63. Multimodal c4: An open, billion-scale corpus of images interleaved with text. NeurIPS, 36, 2024.
Citations (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.