Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs (2405.15683v2)
Abstract: Large Vision-LLMs (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts-those that require simple descriptions of visual elements-but fail for cognitive prompts that demand deliberate reasoning. We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs. VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Recovering intrinsic scene characteristics. Comput. vis. syst, 2(3-26):2, 1978.
- Internlm2 technical report, 2024.
- Chateval: Towards better LLM-based evaluators through multi-agent debate. In The Twelfth International Conference on Learning Representations, 2024.
- How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arxiv 2306.13394 (2023), 2023.
- Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024.
- A closer look at the limitations of instruction tuning. In Forty-first International Conference on Machine Learning, 2024.
- Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023.
- The false promise of imitating proprietary language models. In The Twelfth International Conference on Learning Representations, 2024.
- Multiple view geometry in computer vision. Cambridge university press, 2003.
- Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. arXiv preprint arXiv:2302.11154, 2023.
- Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023.
- Ocr-free document understanding transformer. In European Conference on Computer Vision (ECCV), 2022.
- Contrastive decoding: Open-ended text generation as optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toronto, Canada, July 2023. Association for Computational Linguistics.
- The unlocking spell on base llms: Rethinking alignment via in-context learning, 2023.
- Mitigating hallucination in large multi-modal models via robust instruction tuning, 2024.
- Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023.
- A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations (ICLR), 2024.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922, 2023.
- Depth estimation from image structure. IEEE Transactions on pattern analysis and machine intelligence, 24(9):1226–1238, 2002.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023.
- Measuring multimodal mathematical reasoning with math-vision dataset. arXiv preprint arXiv:2402.14804, 2024.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022.
- Look-back decoding for open-ended text generation. arXiv preprint arXiv:2305.13477, 2023.
- INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 2023.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
- Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of CVPR, 2024.
- Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
- Instruction tuning for large language models: A survey, 2024.
- Debiasing large visual language models. arXiv preprint arXiv:2403.05262, 2024.
- A survey of large language models, 2023.
- Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2024.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.
- Sreyan Ghosh (46 papers)
- Chandra Kiran Reddy Evuru (9 papers)
- Sonal Kumar (30 papers)
- Utkarsh Tyagi (18 papers)
- Oriol Nieto (22 papers)
- Zeyu Jin (33 papers)
- Dinesh Manocha (366 papers)