Scalable detection and causal evaluation of aggregation heads in modern vision–language models
Develop computationally viable methods to automatically detect aggregation attention heads in large-scale vision–language models and to apply causal interventions that validate the role of these heads in achieving symbol grounding between environmental tokens (e.g., visual patch embeddings) and linguistic tokens in the language backbone across diverse architectures.
References
For these reasons, while our case study highlights promising evidence of grounding heads in modern VLMs, systematic detection and causal evaluation of such heads at scale remains an open challenge. Future work will need to develop computationally viable methods for (i) automatically detecting aggregation heads across diverse VLMs, and (ii) applying causal interventions to validate their role in grounding.