Explaining cross-family variation in visual-grounding layer distributions
Determine the underlying architectural or training factors that cause the observed variation in which Transformer layers most accurately localize visual evidence (as measured by AUROC and NDCG alignment with VisualCOT evidence labels) across Vision-Language Model families such as LLaVA, Qwen2.5-VL, and Gemma3.
References
We hypothesize that such differences may stem from family-specific design choices or training strategies, though a deeper understanding of their underlying causes remains open for future investigation.
— Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
(2510.17771 - Liu et al., 20 Oct 2025) in Appendix, Section "Additional Results and Analysis" — Subsection "Full Layer-wise Attention Dynamics Visualization" (sec:ap-res-att)