Scaling behavior of gradient normalization in Transformer pretraining

Ascertain whether RMS gradient normalization applied before AdamW improves performance consistently as model and compute scale in decoder-only Transformer language model pretraining, and explain the scaling-law crossover observed at larger scales; determine if learning-rate scaling or other modifications are required to ensure robustness of gradient normalization at scale.

Background

The paper reports three cases of scaling-law crossover, where methods that perform better at small scales are overtaken by alternatives at larger scales. In the third case, a proposal that includes RMS gradient normalization (along with GeGLU and increased MLP width, and higher weight decay) improves performance and reduces learning-rate sensitivity at small scales, but exhibits a crossover around 2–3×10² exaflops, after which the baseline without gradient normalization outperforms it.

While the authors demonstrate the phenomenon, they do not identify the precise cause, and they question whether gradient normalization becomes detrimental at large scales or whether different learning-rate scaling is needed. They explicitly flag this third case as an open question, emphasizing the complexity of scaling behavior.

References

We present three cases of scaling-law crossover with increasing complexity, offering explanations for the first two but leaving the third as an open question that underscores the complexities of scaling.

— Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling (2409.15156 - Xiao, 23 Sep 2024) in Section "Scaling Law Crossover, a Curse from Scale?" (introductory paragraphs before Subsection "Warmup: Training Instability" and Subsection "Is Gradient Normalization a Good Idea?")

Scaling behavior of gradient normalization in Transformer pretraining

Background

References

Related Problems