Scaling behavior of gradient normalization in Transformer pretraining
Ascertain whether RMS gradient normalization applied before AdamW improves performance consistently as model and compute scale in decoder-only Transformer language model pretraining, and explain the scaling-law crossover observed at larger scales; determine if learning-rate scaling or other modifications are required to ensure robustness of gradient normalization at scale.
References
We present three cases of scaling-law crossover with increasing complexity, offering explanations for the first two but leaving the third as an open question that underscores the complexities of scaling.
                — Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
                
                (2409.15156 - Xiao, 23 Sep 2024) in Section "Scaling Law Crossover, a Curse from Scale?" (introductory paragraphs before Subsection "Warmup: Training Instability" and Subsection "Is Gradient Normalization a Good Idea?")