Papers
Topics
Authors
Recent
2000 character limit reached

Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models

Published 22 Sep 2025 in cs.CV, cs.AI, and cs.LG | (2509.17588v1)

Abstract: Large Vision-LLMs (LVLMs) answer visual questions by transferring information from images to text through a series of attention heads. While this image-to-text information flow is central to visual question answering, its underlying mechanism remains difficult to interpret due to the simultaneous operation of numerous attention heads. To address this challenge, we propose head attribution, a technique inspired by component attribution methods, to identify consistent patterns among attention heads that play a key role in information transfer. Using head attribution, we investigate how LVLMs rely on specific attention heads to identify and answer questions about the main object in an image. Our analysis reveals that a distinct subset of attention heads facilitates the image-to-text information flow. Remarkably, we find that the selection of these heads is governed by the semantic content of the input image rather than its visual appearance. We further examine the flow of information at the token level and discover that (1) text information first propagates to role-related tokens and the final token before receiving image information, and (2) image information is embedded in both object-related and background tokens. Our work provides evidence that image-to-text information flow follows a structured process, and that analysis at the attention-head level offers a promising direction toward understanding the mechanisms of LVLMs.

Summary

  • The paper introduces head attribution to systematically measure each attention head's contribution in transferring image information to text.
  • It reveals that information flow is distributed across layers with mid-to-late heads having greater impact regardless of high attention weights.
  • Experiments on COCO images demonstrate token-level analysis, highlighting the roles of specific tokens in effective image-to-text communication.

Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-LLMs

Introduction

This paper explores the internal mechanisms of Large Vision-LLMs (LVLMs), specifically focusing on the role of attention heads in enabling image-to-text information flow. This flow is pivotal for tasks such as visual question answering, where understanding the transfer of information from visual inputs to textual outputs is crucial. The paper introduces a novel technique called head attribution, which identifies the attention heads responsible for this information transfer.

Head Attribution Method

Head attribution is inspired by component attribution methods and is designed to quantify each attention head's contribution to the final model output. Traditional methods, such as single head ablation, are inadequate for LVLMs due to their reliance on multiple attention heads to distribute the information flow. The paper proposes a linear regression model that estimates the impact of each head on the output logit by systematically ablating heads and evaluating their contributions. Figure 1

Figure 1: Example result of head attribution for LLaVA-1.5-7B.

Experimental Setup

Experiments are conducted across ten LVLMs, focusing on a visual object identification task, which requires models to identify the main object in given images. This task is chosen for its simplicity and ability to highlight the mechanisms of image-to-text information flow. The experiments utilize images from the COCO dataset and involve multiple models, including variations of LLaVA and InternVL.

Key Findings

  1. Distributed Information Flow: The study demonstrates that the image-to-text information flow is distributed across multiple attention heads. Single head ablation does not significantly impact the output logits, highlighting the need for a comprehensive head attribution approach.
  2. Head Contribution Variation: The contribution of attention heads varies across layers, with mid-to-late layer heads playing a more significant role in transferring image information. The analysis reveals that head importance does not necessarily correlate with high image attention weights, contradicting common assumptions. Figure 2

    Figure 2: Minimum number of heads required for faithfulness and completeness across models.

  3. Token-Level Analysis: The research further explores the token-level information flow, identifying that image information is transferred primarily to role-related tokens and the final token, rather than the question tokens directly.

Implications

The findings have far-reaching implications for both interpretability and efficiency in LVLMs. By understanding the image-token interactions and attention head contributions, it becomes feasible to develop more transparent models and potentially reduce computational costs by focusing only on essential heads and tokens. Figure 3

Figure 3: Logit differences relative to critical tokens across models.

Conclusion

The paper presents a robust framework for analyzing attention heads in LVLMs, providing insights into how image-to-text information is transferred. Head attribution emerges as a valuable tool for interpreting LVLMs' internal mechanisms, paving the way for more efficient and transparent AI systems. Future work could explore extending these findings to more complex tasks and refining the attribution methods to enhance scalability and practicality.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.