Multi-head, multi-layer attention dynamics

Investigate and characterize the gradient-driven dynamics of attention in multi-head, multi-layer transformer architectures trained with cross-entropy, specifically determining how inter-head coordination and hierarchical specialization arise and interact across heads and layers.

Background

The paper presents a complete first-order analysis of gradients in a single attention head trained with cross-entropy, deriving advantage-based routing for scores and responsibility-weighted updates for values. It shows that these coupled dynamics induce an EM-like two-timescale behavior and sculpt low-dimensional Bayesian manifolds, but all results are developed in a minimal, single-head, single-layer setting without residual connections or LayerNorm.

While the single-head analysis clarifies how routing and content co-evolve to produce specialized prototypes and focused attention, the extension to realistic transformer blocks with multiple heads and layers introduces additional interactions. In particular, multiple heads may coordinate or compete, and stacked layers may produce hierarchical specialization. The authors explicitly state that understanding these multi-head, multi-layer dynamics—especially inter-head coordination and hierarchical specialization—remains open.

References

Multi-head, multi-layer dynamics---including inter-head coordination and hierarchical specialization---remain open.

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds (2512.22473 - Aggarwal et al., 27 Dec 2025) in Section "Limitations and Future Directions"