Multi-head, multi-layer attention dynamics
Investigate and characterize the gradient-driven dynamics of attention in multi-head, multi-layer transformer architectures trained with cross-entropy, specifically determining how inter-head coordination and hierarchical specialization arise and interact across heads and layers.
Sponsor
References
Multi-head, multi-layer dynamics---including inter-head coordination and hierarchical specialization---remain open.
— Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
(2512.22473 - Aggarwal et al., 27 Dec 2025) in Section "Limitations and Future Directions"