Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning Rate Transfer in Normalized Transformers

Updated 2 May 2026
  • The paper presents a comprehensive analysis of how normalization techniques like LayerNorm and ScaleNorm affect gradient dynamics and stabilize learning rate transfer in Transformer models.
  • It demonstrates that advanced protocols such as Decoupled Relative Learning Rate Schedules (RLRS) and νGPT enable significant speedups and robust performance across varying model widths and depths.
  • The study highlights the practical need for adaptive, component-wise learning rate scaling and weight decay adjustments to maintain stability during cross-model transfers.

Learning rate transfer in normalized Transformers refers to the ability—or inability—to reuse optimal or near-optimal learning rate schedules, selected at one model scale or architecture variant, on larger models or across differing settings, without significant performance loss or instability. For Transformer architectures employing normalization techniques such as LayerNorm, ScaleNorm, or exact 2\ell_2-normalization, both the normalization placement and the hyperparameter scaling protocol can have a dramatic effect on learning rate transfer, optimization stability, and practical throughput at scale.

1. Theoretical Foundations: Gradient Dynamics and Normalization Placement

The sensitivity of Transformer optimization to learning rate schedules originates in the interaction between residual architectures, normalization, and gradient scaling at initialization. In the canonical Post-LN Transformer (layer normalization applied after the residual connection), mean-field analysis shows that the expected norm of gradients with respect to output-layer parameters is O(dvdlogd)O(d_v\,d\,\log d)—exponentially higher than in lower layers, where dvd_v is the hidden dimension within the feed-forward block and dd is the model width. Applying a large learning rate in this case leads directly to instability unless the schedule includes a long warm-up phase. In contrast, modifying the architecture to Pre-LN (layer normalization inside each residual block) regularizes gradients to a uniform O(dlogd/L)O(d\,\log d / \sqrt{L}), dramatically reducing early training instabilities and removing the necessity for elaborate warm-up (Xiong et al., 2020, Nguyen et al., 2019).

Experimental results support this theoretical framing. For instance, IWSLT’14 De→En and WMT’14 En→De machine translation tasks trained with Adam show that Pre-LN Transformers at ηmax=5×104\eta_{\max}=5 \times 10^{-4} or 1.5×1031.5 \times 10^{-3} and no warm-up match the performance of Post-LN baselines while requiring 30–40% fewer epochs. BERT pretraining yields similar results, eliminating the need for the customary 10k warm-up steps and achieving a 40% reduction in steps to a fixed validation loss (Xiong et al., 2020).

2. Parameterization Scaling and Learning Rate Transfer

Scaling hyperparameters to maintain stable optimization across depth and width has spawned a taxonomy of approaches centered on parameterization and alignment. The Maximal Update Parameterization (μP) posits that, to maintain constant "relative representation change" (ΔY/Y\| \Delta Y \| / \| Y \|) as width CC is increased, the learning rate should scale as $1/C$ for hidden and output weights. More precisely, for a layer with weights O(dvdlogd)O(d_v\,d\,\log d)0, O(dvdlogd)O(d_v\,d\,\log d)1 when O(dvdlogd)O(d_v\,d\,\log d)2. This scaling is derived under the assumption of favorable geometric alignments of weights and gradients (i.e., alignment exponents O(dvdlogd)O(d_v\,d\,\log d)3 and O(dvdlogd)O(d_v\,d\,\log d)4), which hold only in the early phases of training (Kosson et al., 21 Oct 2025).

However, empirical evidence demonstrates that these assumptions collapse after a small number of epochs in practical large-scale training (e.g., LLaMA-style Transformers). After this initial "alignment window," the stability of learning rate transfer is preserved by independent scaling of the weight decay coefficient—to keep O(dvdlogd)O(d_v\,d\,\log d)5 constant across width—rather than parameterization or learning rate scaling alone. In effect, for most of training, it is weight decay, not μP, that enables robust transfer of learning rates across width and depth (Kosson et al., 21 Oct 2025, Noci et al., 2024).

3. Component-wise and Per-layer Learning Rate Schedules

Uniform learning rates across all Transformer weights can be suboptimal due to heterogeneous dynamics in submodules (embeddings, LayerNorm, attention, feed-forward layers, router/expert parameters for Mixture-of-Experts). The Decoupled Relative Learning Rate Schedules (RLRS) protocol decomposes the schedule, assigning each component O(dvdlogd)O(d_v\,d\,\log d)6 a separate start and end multiplier (O(dvdlogd)O(d_v\,d\,\log d)7, O(dvdlogd)O(d_v\,d\,\log d)8) applied atop a global schedule (typically cosine). This allows targeted acceleration of components (e.g., higher embedding learning rates early, increasing router/expert rates over time in MoE) and can be efficiently optimized on small proxy models then transferred without loss to models up to 27× larger (Ludziejewski et al., 4 Jul 2025).

Empirically, RLRS yields up to 23% wall-clock speedup for MoE and dense Transformers and eliminates instabilities (such as MoE loss spikes) by initializing critical modules with low learning rates and ramping up. The protocol is robust across normalization choices (LayerNorm or RMSNorm), improved further by grouping all normalization parameters into a dedicated RLRS component (Ludziejewski et al., 4 Jul 2025).

4. Scaling Laws and Alignment Exponents: νGPT and Hyperparameter Invariance

The μP approach, and its recent refinement in νGPT, identifies a key limitation: explicit width scaling in initialization and learning rate does not ensure true transfer if the alignment exponents of weight-gradient pairs deviate from the idealized μP prediction. νGPT leverages measured alignment exponents between gradient updates and activations, which for normalized Transformers empirically cluster near O(dvdlogd)O(d_v\,d\,\log d)9 ("mid-alignment") rather than dvd_v0 (no alignment) or dvd_v1 (full alignment).

Accordingly, νGPT scales block-linear and unembedding learning rates as dvd_v2 and embeddings as dvd_v3. Moreover, it prescribes a dvd_v4 decay when the token horizon (number of steps) increases, and modifies initialization for learned residual mixing parameters with depth. These exponents ensure that once a learning rate is optimized on a base model, it transfers without performance loss or instability to models with altered width, depth, or token horizon (Shigida et al., 29 Apr 2026).

Validation across width (dvd_v5), depth (dvd_v6 up to 128), and training steps (up to 225k tokens) confirms that νGPT collapses the validation-loss vs learning-rate curves onto a universal optimum absent in the baseline nGPT. This demonstrates "lossless" hyperparameter transfer on normalized Transformers, with practical guidelines for learning rate scaling along all axes (Shigida et al., 29 Apr 2026).

5. Warm-Up Mechanisms, Effective Learning Rates, and Stability

Learning rate warm-up remains indispensable in settings where early gradient norms are poorly controlled—especially in Post-LN or models lacking RM/LayerNorm-based scale-invariance (Xiong et al., 2020). However, normalization-centric architectures and parameterizations such as PreNorm, ScaleNorm, and νGPT obviate warm-up by regularizing both forward activations and backward gradient norms.

Mathematically, even with disparate initial layerwise effective learning rates (ELRs), normalization ensures that ELR ratios dvd_v7 converge to unity quickly under a constant global learning rate (Mehmeti-Göpel et al., 2023). A hyperparameter-free "subcritical" warm-up scheme, which tunes the global learning rate at each step to the critical value dvd_v8 (with dvd_v9 over the sublayers with maximal ELRs, aligns all layers in dd0 steps (Mehmeti-Göpel et al., 2023).

In the μP and nGPT context, decay-coupled weight decay or explicit warm-up schedules (e.g., exponential, decay-away) can replicate—and in some cases improve upon—the stability guarantees originally attributed to width-dependent scaling (Kosson et al., 21 Oct 2025).

6. Practical Recommendations: Finetuning and Cross-domain Transfer

For cross-modal or sequential transfer, rigorous learning rate tuning is critical. Even minor misalignment between the pretraining and target task's optimal learning rate can invert empirical conclusions about model efficacy, as observed in cross-modal GPT-2 transfer (Rothermel et al., 2021). Best-practice protocols include log-scale sweeps over multiple orders of magnitude and reporting full sensitivity curves.

During sequential fine-tuning, per-layer or per-block distributions (non-monotonic, layerwise-optimized) outperform flat learning rates, reducing catastrophic forgetting. For BERT-base, Bayesian optimization over partitioned learning rate groups, followed by geometric averaging, led to BERTdd1 distributions that generalize to improved performance on GLUE dataset-shift tasks (Kenneweg et al., 2024). In all normalized architectures, independent learning rate schedules per parameter group should be deployed, with projected global adjustments as task or scale dictates.

7. Impact of Normalization Style: ScaleNorm, FixNorm, and Beyond

Switching from LayerNorm to dd2-based normalization (ScaleNorm, FixNorm) further stabilizes gradient norms and enables larger, flat learning rates without warmup. Empirically, PreNorm + FixNorm + ScaleNorm achieves both smoother gradient norm traces and elevated BLEU scores in low-resource settings (Nguyen et al., 2019). The relationship between depth and learned scale parameters dd3 becomes linear and predictable, facilitating mapping of existing learning rate schedules onto the new normalization regime. Specific conversion formulas—such as adjusting learning rate according to the observed ratio of pre- and post-normalization gradient norms—ensure preservation of effective step sizes in practice.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learning Rate Transfer in Normalized Transformers.