Dice Question Streamline Icon: https://streamlinehq.com

Explaining cross-family variation in visual-grounding layer distributions

Determine the underlying architectural or training factors that cause the observed variation in which Transformer layers most accurately localize visual evidence (as measured by AUROC and NDCG alignment with VisualCOT evidence labels) across Vision-Language Model families such as LLaVA, Qwen2.5-VL, and Gemma3.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper analyzes layer-wise attention dynamics in Vision-LLMs and finds that deeper layers tend to act as visual grounders, focusing sparsely but reliably on evidence regions. Using VisualCOT, the authors quantify evidence attribution with AUROC and NDCG, showing strong alignment of deep-layer attention with ground-truth evidence, even when answers are incorrect.

However, the distribution of layers that maximize evidence attribution differs across model families: LLaVA’s best-performing layers cluster around middle depths, Qwen2.5-VL’s peaks concentrate near the final output layers, and Gemma3 shows a periodic pattern with recurring "good attribution" layers. The authors hypothesize these differences may stem from family-specific design choices or training strategies, but emphasize that a deeper understanding of the underlying causes remains open, motivating this problem.

References

We hypothesize that such differences may stem from family-specific design choices or training strategies, though a deeper understanding of their underlying causes remains open for future investigation.

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs (2510.17771 - Liu et al., 20 Oct 2025) in Appendix, Section "Additional Results and Analysis" — Subsection "Full Layer-wise Attention Dynamics Visualization" (sec:ap-res-att)