This paper investigates how Vision-LLMs (VLMs), specifically focusing on the LLaVA architecture, process visual information. LLaVA combines a pre-trained image encoder (CLIP), a pre-trained LLM (Vicuna), and an adapter network that maps image features into "visual tokens" fed into the LM. While VLMs are powerful, their internal workings, particularly how the LM handles these visual tokens, are not well understood.
The paper addresses key questions about the localization of object information within visual tokens and the mechanisms by which the LM processes this information for predictions. The visual tokens are soft prompts, meaning they don't directly correspond to words in the vocabulary and their meaning is initially unclear.
The researchers employed several practical interpretability techniques:
- Ablation Studies for Localization: To determine where object information resides, they created a dataset using COCO images, filtered to focus on simpler scenes and ensure the model relies on visual evidence rather than hallucination. They then ablated (replaced with a mean embedding) subsets of visual tokens. The subsets were chosen based on:
- Corresponding to the object's spatial location ("Object Tokens")
- Object tokens plus neighbors ("Object Tokens with Buffer")
- Tokens with high norms ("Register Tokens"), hypothesized to encode global features
- Random tokens (baseline)
- Tokens identified as important by Integrated Gradients (stronger baseline)
They evaluated the impact of ablation on the model's ability to identify the target object using three tasks: generative image description, binary "Yes/No" polling, and visual question answering. The results showed that ablating object tokens, especially with a small buffer of neighboring tokens, caused a significantly larger drop in object identification accuracy (over 70% in some cases) compared to ablating register, random, or even high-gradient tokens. This strongly suggests that object-specific information is primarily localized to the visual tokens corresponding to the object's position in the image, rather than being diffused across global tokens.
The ablation involves creating a modified set of visual embeddings from the original :
where is the set of indices of tokens to ablate, and is a mean embedding.
- Logit Lens for Representation Evolution: To understand how visual token representations change through the LM layers, they applied the logit lens technique. This involves projecting the hidden state of each token at each layer into the model's vocabulary space using the unembedding matrix and observing the top predicted tokens.
Surprisingly, they found that in later layers, the representations of visual tokens align with interpretable text tokens describing the content of the original image patch. This included specific details (e.g., "diam" for a diamond pattern on a sweater) and sometimes even non-English terms. This indicates that the LM, despite being fine-tuned and not pre-trained on next-token prediction for visual inputs, refines visual information towards a language-interpretable space. However, some global features (like object counts) sometimes appeared in background tokens, suggesting potential artifacts of the LM's text processing. This finding is significant because it implies the hypothesis that transformer layers iteratively refine representations towards vocabulary concepts might generalize to multimodal fine-tuning. Potential practical uses include deriving coarse segmentation maps from logit lens activations and potentially improving methods for reducing hallucination by directing attention.
- Attention Knockout for Information Flow: To trace how information flows from visual tokens to the final prediction, they used attention knockout. This technique involves setting attention weights between specific token groups to in the attention mask at certain layers to block information flow.
They blocked attention in windows of layers between different token groups:
- From object tokens (with/without buffer) to the Last Token Position (where the model generates the answer).
- From non-object tokens to the Last Token Position.
- Among visual tokens themselves (e.g., non-last row to last row of visual tokens), testing a hypothesis that information is summarized in a subset of visual tokens.
The results showed that blocking attention from object tokens to the Last Token Position in mid to late layers noticeably degraded performance. This suggests that the model directly extracts object-specific information from these localized visual tokens in later processing stages. Blocking attention from non-object tokens in early layers also impacted performance, indicating the early integration of broader contextual information. Crucially, blocking attention among visual tokens themselves had minimal impact, suggesting the model does not rely on summarizing visual information within a specific subset of visual tokens before using it for the final prediction.
In conclusion, the paper provides evidence that in LLaVA: object information is localized to specific visual tokens, these visual tokens become interpretable as language concepts through the layers, and the model extracts information directly from relevant visual tokens in later layers for prediction. These findings are foundational for building more interpretable, controllable, and robust multimodal systems. The techniques used (ablation, logit lens, attention knockout) are practical methods for probing VLM internals and can be applied to further research into hallucination reduction and model editing. The code for the experiments is publicly available, enabling practitioners to replicate and extend these analyses.