MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception (2401.07529v3)
Abstract: Recent advancements in Multimodal LLMs (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding. However, these models also suffer from hallucinations, which limit their reliability as AI systems. We believe that these hallucinations are partially due to the models' struggle with understanding what they can and cannot perceive from images, a capability we refer to as self-awareness in perception. Despite its importance, this aspect of MLLMs has been overlooked in prior studies. In this paper, we aim to define and evaluate the self-awareness of MLLMs in perception. To do this, we first introduce the knowledge quadrant in perception, which helps define what MLLMs know and do not know about images. Using this framework, we propose a novel benchmark, the Self-Awareness in Perception for MLLMs (MM-SAP), specifically designed to assess this capability. We apply MM-SAP to a variety of popular MLLMs, offering a comprehensive analysis of their self-awareness and providing detailed insights. The experiment results reveal that current MLLMs possess limited self-awareness capabilities, pointing to a crucial area for future advancement in the development of trustworthy MLLMs. Code and data are available at https://github.com/YHWmz/MM-SAP.
- Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models. arXiv preprint arXiv:2305.13712.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Sharegpt4v: Improving large multi-modal models with better captions.
- Palm: Scaling language modeling with pathways.
- Holistic analysis of hallucination in gpt-4v(ision): Bias and interference challenges.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Mme: A comprehensive evaluation benchmark for multimodal large language models.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(1):32–73.
- Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 292–305. Association for Computational Linguistics.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566.
- Mitigating hallucination in large multi-modal models via robust instruction tuning.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Mmbench: Is your multi-modal model an all-around player?
- Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models.
- OpenAI. 2023a. GPT-4 technical report.
- OpenAI. 2023b. Gpt-4v(ision) system card.
- Llama: Open and efficient foundation language models.
- Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126.
- Cogvlm: Visual expert for pretrained language models.
- Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8653–8665. Association for Computational Linguistics.
- Do large language models know what they don’t know?
- Mm-vet: Evaluating large multimodal models for integrated capabilities.
- Infmllm: A unified framework for visual-language tasks.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models.