Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning (2404.12966v4)

Published 19 Apr 2024 in cs.CV and cs.AI

Abstract: Recently, Multimodal LLMs (MLLMs) have achieved significant success across multiple disciplines due to their exceptional instruction-following capabilities and extensive world knowledge. However, whether these MLLMs possess human-like compositional reasoning abilities remains an open problem. To unveil their reasoning behaviors, we first curate a \textbf{M}ultimodal \textbf{A}ssumptive \textbf{R}ea\textbf{s}oning Benchmark (MARS-Bench) in this paper. Interestingly, we find that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question, whereas such presuppositions appear naive to human reasoning. Besides, we also propose a simple yet effective method, Active Deduction (AD), to encourage the model to actively perform composite deduction before reaching a final decision. Equipped with the proposed AD method, a MLLM demonstrates significant improvements in assumptive reasoning abilities without compromising its general-purpose question-answering performance. We also provide extensive evaluations of both open-source and private MLLMs on MARS-Bench, along with experimental analyses of the AD method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4971–4980.
  3. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
  4. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
  5. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023).
  6. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
  7. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  8. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023).
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
  11. Kai Epstude and Neal J Roese. 2008. The functional theory of counterfactual thinking. Personality and social psychology review 12, 2 (2008), 168–192.
  12. Jörg Frohberg and Frank Binder. 2022. Crass: A novel data set and benchmark to test counterfactual reasoning of large language models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2126–2140.
  13. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394 (2023).
  14. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023).
  15. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023).
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913.
  17. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709.
  18. Lumen: Unleashing versatile vision-centric capabilities of large multimodal models. arXiv preprint arXiv:2403.07304 (2024).
  19. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092 (2023).
  20. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023).
  21. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726 (2023).
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  23. Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 804–815. https://doi.org/10.18653/v1/2023.acl-short.70
  24. Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models. arXiv preprint arXiv:2311.06607 (2023).
  25. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
  26. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  27. What Makes Good In-Context Examples for GPT-3333? arXiv preprint arXiv:2101.06804 (2021).
  28. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023).
  29. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35 (2022), 2507–2521.
  30. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 3195–3204.
  31. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837 (2022).
  32. Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12700–12710.
  33. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023).
  34. Why think step by step? Reasoning emerges from the locality of experience. Advances in Neural Information Processing Systems 36 (2024).
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  37. Cognitive neuroscience of human counterfactual reasoning. Frontiers in human neuroscience 9 (2015), 420.
  38. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
  39. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846 (2023).
  40. ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 11753–11770. https://doi.org/10.18653/v1/2023.emnlp-main.719
  41. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477 (2023).
  42. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3081–3089.
  43. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023).
  44. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
  45. Ifqa: A dataset for open-domain question answering under counterfactual presuppositions. arXiv preprint arXiv:2305.14010 (2023).
  46. VPGTrans: Transfer visual prompt generator across LLMs. Advances in Neural Information Processing Systems 36 (2024).
  47. What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4629–4633.
  48. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
  49. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  50. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022).
  51. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023).
  52. InfMLLM: A Unified Framework for Visual-Language Tasks. arXiv preprint arXiv:2311.06791 (2023).
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yian Li (7 papers)
  2. Wentao Tian (2 papers)
  3. Yang Jiao (127 papers)
  4. Jingjing Chen (99 papers)
  5. Yu-Gang Jiang (223 papers)
  6. Na Zhao (54 papers)
Citations (9)
X Twitter Logo Streamline Icon: https://streamlinehq.com