Learning Rate Warm-up in Neural Networks

Updated 9 December 2025

Learning Rate Warm-up is a scheduling strategy that gradually increases the rate from a low initial value to a target value over a warm-up period, stabilizing early optimization and harnessing model plasticity.
Various methods such as linear, piecewise-linear, exponential, and adaptive warm-ups are used to safely navigate high-curvature regions and improve convergence, as supported by theoretical frameworks like (H0,H1)-smoothness.
Empirical studies show that warm-up improves early-stage training speed, enhances final generalization, and reduces training instability across modern deep learning architectures.

Learning rate warm-up is a widely deployed scheduling strategy in deep neural network optimization, designed to increase the learning rate from a low initial value to a target value over an initial phase of training. The goal is to stabilize early optimization dynamics, exploit model plasticity, and facilitate faster convergence—particularly in regimes involving large batch sizes, sharp loss landscapes, or continual pre-training. This article provides a comprehensive overview grounded in recent and foundational research on arXiv, covering theoretical rationales, algorithmic instantiations, empirical findings, and practical guidelines.

1. Mathematical Formulation and Schedule Variants

Warm-up strategies are characterized by the gradual ramping of the learning rate $\eta_t$ from an initial value $\eta_{\text{init}}$ (often 0) to a target value $\eta_{\text{trgt}}$ , typically via a linear or sub-exponential schedule over a fixed "warm-up length" $T_{\text{warm}}$ (measured in steps, epochs, or data units):

Linear warm-up: $\eta_t = \eta_{\text{init}} + (\eta_{\text{trgt}} - \eta_{\text{init}}) \cdot \frac{t}{T_{\text{warm}}}$ for $t \leq T_{\text{warm}}$ , then held constant or followed by decay (e.g., cosine, inverse-square-root) (Kalra et al., 13 Jun 2024, Gupta et al., 2023, Ma et al., 2019, Alimisis et al., 3 Oct 2025).
Piecewise-linear / double linear: two-phase linear ramps with intermediate plateau (e.g., ramping to $\eta'/10$ over $w'$ steps, then up to $\eta$ by $w$ steps), used in deep speech-to-text models (Gaido et al., 29 May 2025).
Polynomial / exponential warm-up: $\eta_t = \eta \cdot (t/T_{\text{warm}})^{\alpha}$ or $\eta_t = \eta \cdot \frac{e^{\alpha t / T_{\text{warm}}} - 1}{e^{\alpha} - 1}$ , providing faster initial increase, critical for extremely deep architectures (Gaido et al., 29 May 2025).
Adaptive warm-up (theoretical): set $\eta_t$ as an explicit function of instantaneous loss or curvature, e.g. $\eta_t = \frac{1}{10H_0 + 20H_1 (f(w_t)-f^*)}$ under $(H_0,H_1)$ -smoothness (Alimisis et al., 3 Oct 2025), or more generally as $\eta_t = \frac{1}{4\sqrt{2}+4} \min\{1/K_0, 1/(3^\rho \Delta_t^\rho)\}$ under $(\rho,K_0,K_\rho)$ -smoothness assumptions (Liu et al., 9 Sep 2025).

After the warm-up phase, most schedules proceed with a plateau/stable phase and ultimately a decay phase ("warmup-stable-decay" or WSD), per best practices observed in large-scale LLM training (Li et al., 23 Sep 2025, Liu et al., 6 Jul 2025).

2. Theoretical Mechanisms and Smoothness Conditions

Recent theory links the efficacy of warm-up to generalized notions of local curvature smoothness:

$(H_0,H_1)$ -smoothness: For a loss $f$ , the local spectral norm of the Hessian is bounded by $H_0 + H_1(f(w)-f^*)$ . This construction ensures curvature decreases reliably as the loss approaches the minimum (Alimisis et al., 3 Oct 2025). Under such a schedule, step sizes can be safely increased as the model adapts to flatter regions.
$(\rho,K_0,K_\rho)$ -smoothness: The Hessian satisfies $\nabla^2 f(w) \preceq K_0 I + (f(w)-f^*)^\rho I$ , generalizing classical $L$ -smoothness (Liu et al., 9 Sep 2025). This enables convergence rate improvements: deterministic GD with warm-up can achieve $O(1/T)$ convergence in the norm of the gradient, compared to $O(1/\sqrt T)$ without warm-up, and even $O(\Theta(T))$ times faster convergence in certain convex cases.
Valley–river loss landscapes & Mpemba effect: Analysis of two-timescale training (sharp "valley" directions vs. slow "river" directions) reveals a dynamical advantage to warming up into a higher LR plateau followed by decay, paralleling thermodynamic pre-heating phenomena (Liu et al., 6 Jul 2025).

These technical perspectives justify warm-up as a principled mechanism for safely crossing regions of high curvature and transient instability, allowing larger $\eta_{\text{trgt}}$ and more robust convergence.

3. Mechanisms of Stabilization and Optimization

Warm-up predominantly stabilizes optimization via several tightly coupled mechanisms:

Sharpness reduction and curvature adaptation: Warm-up steers models from unstable, high-curvature regions towards flatter domains where higher learning rates are tolerable. The "catapult mechanism" formally denotes loss spikes triggered when $\eta_t \lambda_{Ht} > 2$ (for SGD), followed by sharpness reduction events (Kalra et al., 13 Jun 2024).
Controlling update magnitudes: For Adam-family optimizers, warm-up limits the magnitude of early parameter updates, counteracting bias-correction instabilities (Ma et al., 2019, Kosson et al., 31 Oct 2024).
Mitigating gradient variance and SNR effects: Early high gradient SNR $\phi$ and limited "critical batch size" can result in large angular steps and drastic shifts in network representations; warm-up provides an implicit control mechanism by pacing the learning rate until SNR decays (Kosson et al., 31 Oct 2024).
Avoiding early divergence: Both theory and empirical observations show that without warm-up, aggressively high $\eta$ leads to training breakdown—especially in architectures with deep nonlinearity or post-layer-norm configurations (Kalra et al., 13 Jun 2024).

These mechanisms are responsible for the empirical universal adoption of warm-up across large-batch, LLM, speech, and vision applications.

4. Empirical Behavior and Quantitative Outcomes

Empirical studies consistently find that learning rate warm-up accelerates early-stage training, enables safer use of larger peak $\eta_{\text{trgt}}$ , and enhances final generalization for constant compute budgets:

Continual pre-training in LLMs: Re-warming with a short linear ramp to $\eta_{max}$ outperforms both from-scratch training and naïve decay—final loss and perplexity are lower, adaptation to new domains is faster, and the process is notably robust to warm-up length $T_{warm}$ (Gupta et al., 2023).
AdamW and adaptive optimizers: Roughly 2000 steps (for $\beta_2=0.999$ ) of linear warm-up align with predicted stationary update magnitude, yielding identical performance to RAdam and eliminating the need for elaborate variance rectification (Ma et al., 2019).
Deep S2T convergence: Piecewise-linear and exponential schedules outperform linear and polynomial schedules in gradient stability and word error rate, especially for Conformer/Branchformer architectures near 1B parameters (see table below) (Gaido et al., 29 May 2025).
Batch size and representation dynamics: Warm-up closes the gap between fast early progress and final loss in GPT-style models, eliminating ~0.1–0.2 validation loss disadvantage otherwise present with high batch size (Kosson et al., 31 Oct 2024).
Lyapunov-based SGDM analysis: Warm-up with increasing batch size achieves faster training loss and gradient-norm decay than non-warm-up schedules, empirically converging in 60–80 epochs fewer on ResNet-18/CIFAR100 (Kondo et al., 5 Aug 2025).

Schedule	Early Stability	Convergence Speed	Final Generalization
Linear warm-up	High	Fast	Robust
Piecewise-linear	Highest	Slightly slower	Marginally better
Exponential	Moderate	Fastest initial	Robust
No warm-up	Low	Divergent/unreliable	Poor

5. Warm-up Length, Shape, and Elimination

Multiple studies demonstrate that both the duration and detailed shape of warm-up exert limited influence on final performance once basic stability is assured:

Warm-up length insensitivity: In LLM continual pre-training and kernel regression scaling law analysis, the specific length and ramp shape of the warm-up phase make little difference to final perplexity after several tens of billions of tokens (Gupta et al., 2023, Li et al., 23 Sep 2025).
Sub-exponential schedules: Aggressive polynomial warm-ups can destabilize very deep S2T encoders; sub-exponential (exponential or two-phase linear) schedules should be preferred for critical stability (Gaido et al., 29 May 2025).
Theoretical minimization: By estimating initial curvature (sharpness), one may select $\eta_{\text{init}}$ close to the critical threshold and reduce or even eliminate warm-up (see GI-Adam and parameterization-dependent recommendations) (Kalra et al., 13 Jun 2024, Kosson et al., 31 Oct 2024).
Adaptive criteria: Online estimation of sharpness, SNR, or gradient magnitude, as well as layerwise normalization (CLARS), can supplant the need for fixed warm-up phases (2002.01576, Kosson et al., 31 Oct 2024).

In practice, a short warm-up (1–5% of total steps) is sufficient for most settings, with adaptive schedules further enabling efficient training.

6. Practical Guidelines and Algorithmic Recipes

Synthesized from empirical and theoretical literature, the following best practices are recommended:

Initiate a short warm-up (1–5% total steps) to ramp from low (or critical) initial $\eta$ up to target $\eta_{\text{max}}$ —long ramps confer marginal additional benefit except in high-curvature regimes (Gupta et al., 2023, Li et al., 23 Sep 2025, Kosson et al., 31 Oct 2024).
Follow warm-up by a plateau and then decay phase—cosine or inverse-square-root decay is common; total schedule should fit within compute or data budget constraints (Liu et al., 6 Jul 2025, Li et al., 23 Sep 2025).
For Adam-family optimizers: default to linear warm-up over $2/(1-\beta_2)$ iterations, or initialize with GI-Adam ( $v_0=g_0^2$ ) for automatic stabilization (Ma et al., 2019, Kalra et al., 13 Jun 2024).
Adjust peak learning rate as a control knob: higher $\eta_{max}$ accelerates adaptation but increases forgetting; tune according to upstream/downstream performance priorities (Gupta et al., 2023).
Monitor sharpness, SNR, and gradient-norm metrics: if instability is low, warm-up may be further shortened or omitted; in large-batch or high-SNR regimes delay full $\eta$ accordingly (Kosson et al., 31 Oct 2024, 2002.01576).
Advanced layer-wise or angular-step normalization methods can obviate warm-up in deeper, wider, or large-batch models (e.g., CLARS, LionAR, RRC correction) (2002.01576, Kosson et al., 31 Oct 2024).
In continual pre-training/re-warm settings, always reset the LR schedule and use the final converged checkpoint; rewinding or skipping warm-up is generally suboptimal (Gupta et al., 2023).

7. Open Questions and Future Directions

While warm-up is both theoretically justified and empirically essential in current practice, key ongoing areas of investigation include:

Scaling laws for optimal warm-up length: Functional Scaling Law (FSL) provides closed-form expressions for noise–time trade-offs in LLMs; optimal fractions are found to be vanishingly small for large-scale and hard tasks (Li et al., 23 Sep 2025).
Curvature and catapult-triggered adaptive schedules: Ongoing research into real-time sharpness estimation, loss catapult frequency, and optimizer-initialization strategies may further minimize manual warm-up tuning (Kalra et al., 13 Jun 2024, Alimisis et al., 3 Oct 2025).
Integration with batch size and optimizer dynamics: Coupling warm-up with dynamic batch-size schedules enhances both theoretical convergence rates and practical training efficiency (Kondo et al., 5 Aug 2025, 2002.01576).

The consensus of recent research is that learning rate warm-up, when correctly parameterized and adaptively applied, remains a robust solution for stabilizing and accelerating deep neural training across diverse regimes, with flexible implementations now supported by both theory and large-scale empirical evidence.