Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning (2404.12966v4)
Abstract: Recently, Multimodal LLMs (MLLMs) have achieved significant success across multiple disciplines due to their exceptional instruction-following capabilities and extensive world knowledge. However, whether these MLLMs possess human-like compositional reasoning abilities remains an open problem. To unveil their reasoning behaviors, we first curate a \textbf{M}ultimodal \textbf{A}ssumptive \textbf{R}ea\textbf{s}oning Benchmark (MARS-Bench) in this paper. Interestingly, we find that most prevalent MLLMs can be easily fooled by the introduction of a presupposition into the question, whereas such presuppositions appear naive to human reasoning. Besides, we also propose a simple yet effective method, Active Deduction (AD), to encourage the model to actively perform composite deduction before reaching a final decision. Equipped with the proposed AD method, a MLLM demonstrates significant improvements in assumptive reasoning abilities without compromising its general-purpose question-answering performance. We also provide extensive evaluations of both open-source and private MLLMs on MARS-Bench, along with experimental analyses of the AD method.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4971–4980.
- Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023).
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195 (2023).
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
- Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
- Kai Epstude and Neal J Roese. 2008. The functional theory of counterfactual thinking. Personality and social psychology review 12, 2 (2008), 168–192.
- Jörg Frohberg and Frank Binder. 2022. Crass: A novel data set and benchmark to test counterfactual reasoning of large language models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2126–2140.
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394 (2023).
- Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102 (2023).
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010 (2023).
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6700–6709.
- Lumen: Unleashing versatile vision-centric capabilities of large multimodal models. arXiv preprint arXiv:2403.07304 (2024).
- Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092 (2023).
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023).
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726 (2023).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
- Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 804–815. https://doi.org/10.18653/v1/2023.acl-short.70
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models. arXiv preprint arXiv:2311.06607 (2023).
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
- Visual instruction tuning. Advances in neural information processing systems 36 (2024).
- What Makes Good In-Context Examples for GPT-3333? arXiv preprint arXiv:2101.06804 (2021).
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023).
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35 (2022), 2507–2521.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. 3195–3204.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837 (2022).
- Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12700–12710.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023).
- Why think step by step? Reasoning emerges from the locality of experience. Advances in Neural Information Processing Systems 36 (2024).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Cognitive neuroscience of human counterfactual reasoning. Frontiers in human neuroscience 9 (2015), 420.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846 (2023).
- ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 11753–11770. https://doi.org/10.18653/v1/2023.emnlp-main.719
- Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477 (2023).
- An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3081–3089.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023).
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
- Ifqa: A dataset for open-domain question answering under counterfactual presuppositions. arXiv preprint arXiv:2305.14010 (2023).
- VPGTrans: Transfer visual prompt generator across LLMs. Advances in Neural Information Processing Systems 36 (2024).
- What if the tv was off? examining counterfactual reasoning abilities of multi-modal language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4629–4633.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022).
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023).
- InfMLLM: A Unified Framework for Visual-Language Tasks. arXiv preprint arXiv:2311.06791 (2023).
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
- Yian Li (7 papers)
- Wentao Tian (2 papers)
- Yang Jiao (127 papers)
- Jingjing Chen (99 papers)
- Yu-Gang Jiang (223 papers)
- Na Zhao (54 papers)