Where do Large Vision-Language Models Look at when Answering Questions? (2503.13891v1)

Published 18 Mar 2025 in cs.CV and cs.CL

Abstract: Large Vision-LLMs (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (9)

Xiaoying Xing (6 papers)
Chia-Wen Kuo (14 papers)
Li Fuxin (36 papers)
Yulei Niu (32 papers)
Fan Chen (85 papers)
Ming Li (787 papers)
Ying Wu (134 papers)
Longyin Wen (45 papers)
Sijie Zhu (27 papers)

GitHub

GitHub - bytedance/LVLM_Interpretation: The official repo for "Where do Large Vision-Language Models Look at when Answering Questions?" (2 stars)

Tweets

https://twitter.com/ai_database/status/1902639615057736031

https://twitter.com/aadicodes/status/1903213506356478280

https://twitter.com/aadicodes/status/1903213471120101790

https://twitter.com/GptMaestro/status/1906246070415360375

Where do Large Vision-Language Models Look at when Answering Questions? (2503.13891v1)

Related Papers

GitHub

Tweets