Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning (2405.18358v1)

Published 28 May 2024 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Recent advancements in Multi-modal LLMs (MLLMs) have significantly improved their performance in tasks combining vision and language. However, challenges persist in detailed multi-modal understanding, comprehension of complex tasks, and reasoning over multi-modal information. This paper introduces MMCTAgent, a novel multi-modal critical thinking agent framework designed to address the inherent limitations of current MLLMs in complex visual reasoning tasks. Inspired by human cognitive processes and critical thinking, MMCTAgent iteratively analyzes multi-modal information, decomposes queries, plans strategies, and dynamically evolves its reasoning. Additionally, MMCTAgent incorporates critical thinking elements such as verification of final answers and self-reflection through a novel approach that defines a vision-based critic and identifies task-specific evaluation criteria, thereby enhancing its decision-making abilities. Through rigorous evaluations across various image and video understanding benchmarks, we demonstrate that MMCTAgent (with and without the critic) outperforms both foundational MLLMs and other tool-augmented pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Youtube-8m: A large-scale video classification benchmark, 2016.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Vqa: Visual question answering, 2016.
  4. Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://api.semanticscholar.org/CorpusID:268232499.
  5. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  6. Memory consolidation enables long-context video understanding, 2024.
  7. See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning. arXiv preprint arXiv:2301.05226, 2023.
  8. O. Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  9. Videoagent: A memory-augmented multimodal agent for video understanding, 2024.
  10. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  11. Towards mitigating LLM hallucination via self reflection. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.123. URL https://aclanthology.org/2023.findings-emnlp.123.
  12. A diagram is worth a dozen images. ArXiv, abs/1603.07396, 2016. URL https://api.semanticscholar.org/CorpusID:2682274.
  13. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
  14. Video-llava: Learning united visual representation by alignment before projection, 2023a.
  15. Learning to model the world with language, 2023b.
  16. Mm-vid: Advancing video understanding with gpt-4v (ision). arXiv preprint arXiv:2310.19773, 2023c.
  17. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024a.
  18. Mmbench: Is your multi-modal model an all-around player?, 2024b.
  19. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024a.
  20. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024b.
  21. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023.
  22. Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019.
  23. Microsoft. Gpt with vision - azure openai service. https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/gpt-with-vision, 2024a. Accessed: 2024-05-22.
  24. Microsoft. Azure openai service. https://azure.microsoft.com/en-us/products/ai-services/openai-service, 2024b. Accessed: 2024-05-22.
  25. Microsoft. Reference video search - azure ai computer vision. https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/reference-video-search, 2024c. Accessed: 2024-05-22.
  26. OpenAI. Gpt-4. https://openai.com/index/gpt-4-research/, 2023. Accessed: 2024-05-22.
  27. Learning transferable visual models from natural language supervision, 2021.
  28. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
  29. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
  30. Reflexion: Language agents with verbal reinforcement learning, 2023.
  31. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  32. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  33. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023.
  34. Videoagent: Long-form video understanding with large language model as agent, 2024a.
  35. Internvid: A large-scale video-text dataset for multimodal understanding and generation, 2024b.
  36. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  37. Large language models are better reasoners with self-verification, 2023.
  38. Wikipedia contributors. Critical thinking — wikipedia, the free encyclopedia, 2024. URL https://en.wikipedia.org/wiki/Critical_thinking. [Online; accessed 22-May-2024].
  39. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  40. React: Synergizing reasoning and acting in language models, 2023.
  41. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  42. Idealgpt: Iteratively decomposing vision and language reasoning via large language models. arXiv preprint arXiv:2305.14985, 2023.
  43. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  44. Florence: A new foundation model for computer vision, 2021.
  45. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  46. A simple llm framework for long-range video question-answering, 2024.
  47. Llmeval: A preliminary study on how to evaluate large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Somnath Kumar (3 papers)
  2. Yash Gadhia (2 papers)
  3. Tanuja Ganu (22 papers)
  4. Akshay Nambi (14 papers)