Learning Rate Transfer in Normalized Transformers
- The paper presents a comprehensive analysis of how normalization techniques like LayerNorm and ScaleNorm affect gradient dynamics and stabilize learning rate transfer in Transformer models.
- It demonstrates that advanced protocols such as Decoupled Relative Learning Rate Schedules (RLRS) and νGPT enable significant speedups and robust performance across varying model widths and depths.
- The study highlights the practical need for adaptive, component-wise learning rate scaling and weight decay adjustments to maintain stability during cross-model transfers.
Learning rate transfer in normalized Transformers refers to the ability—or inability—to reuse optimal or near-optimal learning rate schedules, selected at one model scale or architecture variant, on larger models or across differing settings, without significant performance loss or instability. For Transformer architectures employing normalization techniques such as LayerNorm, ScaleNorm, or exact -normalization, both the normalization placement and the hyperparameter scaling protocol can have a dramatic effect on learning rate transfer, optimization stability, and practical throughput at scale.
1. Theoretical Foundations: Gradient Dynamics and Normalization Placement
The sensitivity of Transformer optimization to learning rate schedules originates in the interaction between residual architectures, normalization, and gradient scaling at initialization. In the canonical Post-LN Transformer (layer normalization applied after the residual connection), mean-field analysis shows that the expected norm of gradients with respect to output-layer parameters is —exponentially higher than in lower layers, where is the hidden dimension within the feed-forward block and is the model width. Applying a large learning rate in this case leads directly to instability unless the schedule includes a long warm-up phase. In contrast, modifying the architecture to Pre-LN (layer normalization inside each residual block) regularizes gradients to a uniform , dramatically reducing early training instabilities and removing the necessity for elaborate warm-up (Xiong et al., 2020, Nguyen et al., 2019).
Experimental results support this theoretical framing. For instance, IWSLT’14 De→En and WMT’14 En→De machine translation tasks trained with Adam show that Pre-LN Transformers at or and no warm-up match the performance of Post-LN baselines while requiring 30–40% fewer epochs. BERT pretraining yields similar results, eliminating the need for the customary 10k warm-up steps and achieving a 40% reduction in steps to a fixed validation loss (Xiong et al., 2020).
2. Parameterization Scaling and Learning Rate Transfer
Scaling hyperparameters to maintain stable optimization across depth and width has spawned a taxonomy of approaches centered on parameterization and alignment. The Maximal Update Parameterization (μP) posits that, to maintain constant "relative representation change" () as width is increased, the learning rate should scale as $1/C$ for hidden and output weights. More precisely, for a layer with weights 0, 1 when 2. This scaling is derived under the assumption of favorable geometric alignments of weights and gradients (i.e., alignment exponents 3 and 4), which hold only in the early phases of training (Kosson et al., 21 Oct 2025).
However, empirical evidence demonstrates that these assumptions collapse after a small number of epochs in practical large-scale training (e.g., LLaMA-style Transformers). After this initial "alignment window," the stability of learning rate transfer is preserved by independent scaling of the weight decay coefficient—to keep 5 constant across width—rather than parameterization or learning rate scaling alone. In effect, for most of training, it is weight decay, not μP, that enables robust transfer of learning rates across width and depth (Kosson et al., 21 Oct 2025, Noci et al., 2024).
3. Component-wise and Per-layer Learning Rate Schedules
Uniform learning rates across all Transformer weights can be suboptimal due to heterogeneous dynamics in submodules (embeddings, LayerNorm, attention, feed-forward layers, router/expert parameters for Mixture-of-Experts). The Decoupled Relative Learning Rate Schedules (RLRS) protocol decomposes the schedule, assigning each component 6 a separate start and end multiplier (7, 8) applied atop a global schedule (typically cosine). This allows targeted acceleration of components (e.g., higher embedding learning rates early, increasing router/expert rates over time in MoE) and can be efficiently optimized on small proxy models then transferred without loss to models up to 27× larger (Ludziejewski et al., 4 Jul 2025).
Empirically, RLRS yields up to 23% wall-clock speedup for MoE and dense Transformers and eliminates instabilities (such as MoE loss spikes) by initializing critical modules with low learning rates and ramping up. The protocol is robust across normalization choices (LayerNorm or RMSNorm), improved further by grouping all normalization parameters into a dedicated RLRS component (Ludziejewski et al., 4 Jul 2025).
4. Scaling Laws and Alignment Exponents: νGPT and Hyperparameter Invariance
The μP approach, and its recent refinement in νGPT, identifies a key limitation: explicit width scaling in initialization and learning rate does not ensure true transfer if the alignment exponents of weight-gradient pairs deviate from the idealized μP prediction. νGPT leverages measured alignment exponents between gradient updates and activations, which for normalized Transformers empirically cluster near 9 ("mid-alignment") rather than 0 (no alignment) or 1 (full alignment).
Accordingly, νGPT scales block-linear and unembedding learning rates as 2 and embeddings as 3. Moreover, it prescribes a 4 decay when the token horizon (number of steps) increases, and modifies initialization for learned residual mixing parameters with depth. These exponents ensure that once a learning rate is optimized on a base model, it transfers without performance loss or instability to models with altered width, depth, or token horizon (Shigida et al., 29 Apr 2026).
Validation across width (5), depth (6 up to 128), and training steps (up to 225k tokens) confirms that νGPT collapses the validation-loss vs learning-rate curves onto a universal optimum absent in the baseline nGPT. This demonstrates "lossless" hyperparameter transfer on normalized Transformers, with practical guidelines for learning rate scaling along all axes (Shigida et al., 29 Apr 2026).
5. Warm-Up Mechanisms, Effective Learning Rates, and Stability
Learning rate warm-up remains indispensable in settings where early gradient norms are poorly controlled—especially in Post-LN or models lacking RM/LayerNorm-based scale-invariance (Xiong et al., 2020). However, normalization-centric architectures and parameterizations such as PreNorm, ScaleNorm, and νGPT obviate warm-up by regularizing both forward activations and backward gradient norms.
Mathematically, even with disparate initial layerwise effective learning rates (ELRs), normalization ensures that ELR ratios 7 converge to unity quickly under a constant global learning rate (Mehmeti-Göpel et al., 2023). A hyperparameter-free "subcritical" warm-up scheme, which tunes the global learning rate at each step to the critical value 8 (with 9 over the sublayers with maximal ELRs, aligns all layers in 0 steps (Mehmeti-Göpel et al., 2023).
In the μP and nGPT context, decay-coupled weight decay or explicit warm-up schedules (e.g., exponential, decay-away) can replicate—and in some cases improve upon—the stability guarantees originally attributed to width-dependent scaling (Kosson et al., 21 Oct 2025).
6. Practical Recommendations: Finetuning and Cross-domain Transfer
For cross-modal or sequential transfer, rigorous learning rate tuning is critical. Even minor misalignment between the pretraining and target task's optimal learning rate can invert empirical conclusions about model efficacy, as observed in cross-modal GPT-2 transfer (Rothermel et al., 2021). Best-practice protocols include log-scale sweeps over multiple orders of magnitude and reporting full sensitivity curves.
During sequential fine-tuning, per-layer or per-block distributions (non-monotonic, layerwise-optimized) outperform flat learning rates, reducing catastrophic forgetting. For BERT-base, Bayesian optimization over partitioned learning rate groups, followed by geometric averaging, led to BERT1 distributions that generalize to improved performance on GLUE dataset-shift tasks (Kenneweg et al., 2024). In all normalized architectures, independent learning rate schedules per parameter group should be deployed, with projected global adjustments as task or scale dictates.
7. Impact of Normalization Style: ScaleNorm, FixNorm, and Beyond
Switching from LayerNorm to 2-based normalization (ScaleNorm, FixNorm) further stabilizes gradient norms and enables larger, flat learning rates without warmup. Empirically, PreNorm + FixNorm + ScaleNorm achieves both smoother gradient norm traces and elevated BLEU scores in low-resource settings (Nguyen et al., 2019). The relationship between depth and learned scale parameters 3 becomes linear and predictable, facilitating mapping of existing learning rate schedules onto the new normalization regime. Specific conversion formulas—such as adjusting learning rate according to the observed ratio of pre- and post-normalization gradient norms—ensure preservation of effective step sizes in practice.
References
- "On Layer Normalization in the Transformer Architecture" (Xiong et al., 2020)
- "Decoupled Relative Learning Rate Schedules" (Ludziejewski et al., 4 Jul 2025)
- "Weight Decay may matter more than muP for Learning Rate Transfer in Practice" (Kosson et al., 21 Oct 2025)
- "On the Weight Dynamics of Deep Normalized Networks" (Mehmeti-Göpel et al., 2023)
- "Don't Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers" (Rothermel et al., 2021)
- "Super Consistency of Neural Network Landscapes and Learning Rate Transfer" (Noci et al., 2024)
- "Learning Rate Transfer in Normalized Transformers" (Shigida et al., 29 Apr 2026)
- "Intelligent Learning Rate Distribution to reduce Catastrophic Forgetting in Transformers" (Kenneweg et al., 2024)
- "Transformers without Tears