Representation collapse and its dependence on architectural scaling

Determine the conditions and mechanisms under which representation collapse (clustering or low-rank degeneration of token embeddings) occurs in deep multi-head self-attention Transformers at initialization, and characterize how the phenomenon depends on the joint scaling of architectural hyperparameters such as depth, width, number of attention heads, residual scaling, and weight initialization temperature.

Background

The paper studies deep attention-only Transformers with i.i.d. random weights across layers and heads, modeling initialization-time dynamics. A central empirical concern is representation collapse (oversmoothing or rank collapse), where token embeddings become clustered or low-dimensional with increasing depth, harming expressivity. Prior works often analyzed tied-weight models; this paper shows clustering can still arise even in fully heterogeneous, randomly resampled settings under appropriate scaling.

Within this context, the authors point out that beyond heuristics and empirical observations, a rigorous understanding of when and how collapse occurs and how it depends on architectural scaling remains unresolved. Their homogenized transformer framework derives drift–diffusion limits and provides tools to analyze collapse, but the general dependence on scaling across hyperparameters is explicitly identified as an open question.

References

Understanding when and how such collapse occurs and how it depends on architectural scaling remains an open theoretical question.

Homogenized Transformers  (2604.01978 - Koubbi et al., 2 Apr 2026) in Section 1 (Introduction)