Theoretical necessity of rescaling in transformer training
Develop a theoretical explanation for why outlier-driven rescaling—where emergent outliers interact with normalization layers such as RMSNorm to rescale non-outlier components—is necessary for effective training and representation learning in transformer-based large language models.
References
However, we do not investigate why such rescaling is necessary for effective training or representation learning. A deeper theoretical understanding of the role of rescaling remains an open question.
— A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
(2601.22966 - Qiu et al., 30 Jan 2026) in Section: Limitations