Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM (2404.16033v1)

Published 24 Apr 2024 in cs.CV and cs.CL

Abstract: With the advent of LLMs(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal LLMs(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: https://ggg0919.github.io/cantor/ .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 17682–17690, 2024.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  8. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246, 2023.
  9. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  10. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a.
  11. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023b.
  12. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023.
  13. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems, 35:6704–6719, 2022.
  14. Multi-modal latent space learning for chain-of-thought reasoning in language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 18180–18187, 2024.
  15. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022.
  16. Scitune: Aligning large language models with scientific multimodal instructions. arXiv preprint arXiv:2307.01139, 2023.
  17. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models. arXiv preprint arXiv:2312.03052, 2023.
  18. Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128, 2023.
  19. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
  20. UnifiedQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics (EMNLP), pages 1896–1907, 2020.
  21. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021.
  22. Boosting logical reasoning in large language models through a new framework: The graph of thought. arXiv preprint arXiv:2308.08614, 2023.
  23. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
  24. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023a.
  25. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023b.
  26. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  27. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  28. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  29. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  30. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  31. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  32. Compositional chain-of-thought prompting for large multimodal models. arXiv preprint arXiv:2311.17076, 2023.
  33. OpenAI. Chatgpt, 2022.
  34. OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023.
  35. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  36. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  38. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36, 2024.
  39. Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
  40. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  41. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 19162–19170, 2024.
  42. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  43. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  44. Good questions help zero-shot image reasoning. arXiv preprint arXiv:2312.01598, 2023.
  45. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  46. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023.
  47. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023a.
  48. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023b.
  49. Cocot: Contrastive chain-of-thought prompting for large multimodal models with multiple image inputs. arXiv preprint arXiv:2401.02582, 2024.
  50. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
  51. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  52. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023b.
  53. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. Advances in Neural Information Processing Systems, 36:5168–5191, 2023.
  54. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  55. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Timin Gao (4 papers)
  2. Peixian Chen (21 papers)
  3. Mengdan Zhang (18 papers)
  4. Chaoyou Fu (46 papers)
  5. Yunhang Shen (54 papers)
  6. Yan Zhang (954 papers)
  7. Xiawu Zheng (63 papers)
  8. Xing Sun (93 papers)
  9. Liujuan Cao (73 papers)
  10. Rongrong Ji (315 papers)
  11. ShengChuan Zhang (41 papers)
Citations (6)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub