Theoretical necessity of rescaling in transformer training

Develop a theoretical explanation for why outlier-driven rescaling—where emergent outliers interact with normalization layers such as RMSNorm to rescale non-outlier components—is necessary for effective training and representation learning in transformer-based large language models.

Background

The paper empirically argues that emergent outliers (attention sinks and residual sinks) interact with normalization mechanisms (softmax attention and RMSNorm) to perform outlier-driven rescaling, which stabilizes training and improves performance. It shows that removing or directly suppressing these outliers degrades training stability and final performance, while explicit gating-based rescaling (e.g., GatedNorm, GatedAttention) can replace the need for outliers and yield smoother, more quantization-friendly activations.

Despite these empirical findings, the authors note that they do not address the underlying theoretical rationale for why such rescaling is needed in the first place. They explicitly state that a deeper theoretical understanding of the role and necessity of rescaling in effective training and representation learning remains an open question.

References

However, we do not investigate why such rescaling is necessary for effective training or representation learning. A deeper theoretical understanding of the role of rescaling remains an open question.