Representation collapse and its dependence on architectural scaling
Determine the conditions and mechanisms under which representation collapse (clustering or low-rank degeneration of token embeddings) occurs in deep multi-head self-attention Transformers at initialization, and characterize how the phenomenon depends on the joint scaling of architectural hyperparameters such as depth, width, number of attention heads, residual scaling, and weight initialization temperature.
References
Understanding when and how such collapse occurs and how it depends on architectural scaling remains an open theoretical question.
— Homogenized Transformers
(2604.01978 - Koubbi et al., 2 Apr 2026) in Section 1 (Introduction)