Learning Rate Warm-Up

Updated 6 October 2025

Learning rate warm-up is a scheduling approach that incrementally increases the learning rate from a low initial value to safely navigate high-curvature regions in the loss landscape.
It stabilizes training by mitigating large early updates and enabling aggressive later-stage learning rates for both stochastic and adaptive optimizers.
Variants such as linear, exponential, and piecewise schedules are widely used across language, vision, and large-scale pretrained models to improve convergence.

Learning rate warm-up is a fundamental scheduling strategy in modern deep neural network training, in which the learning rate is initialized at a small value and then gradually increased—often linearly or according to another schedule—over an initial phase of optimization before transitioning to a plateau or decay regime. This practice, now ubiquitous in large-scale supervised learning, language modeling, and vision applications, is motivated by both empirical stability and convergence considerations and has, over the past decade, become theoretically better understood through advances in nonconvex optimization and loss landscape analysis. The warm-up phase directly exploits non-uniform curvature and conditioning in deep network objectives, allowing the optimizer to traverse sharp, high-curvature regions near random initialization before adopting aggressive learning rates that accelerate progress toward flatter, better-conditioned solutions. A substantial body of research characterizes the mechanistic, empirical, and mathematical rationales for warm-up, and explains its variants, tradeoffs, and extensions.

1. Curvature, Conditioning, and the Mechanisms of Warm-Up

During early training, deep neural networks typically begin in regions of high loss and large Hessian eigenvalues—i.e., high curvature or "sharpness." Classical analysis, through either uniform L-smoothness or more general local curvature models, dictates that the maximal stable step size $\eta$ is limited by the largest eigenvalue of the Hessian, $\lambda_{\text{max}}$ ; instability (catastrophic divergence or parameter blowup) occurs if $\eta > 2/\lambda_{\text{max}}$ under quadratic approximation, and the effective threshold for Adam or preconditioned methods shifts accordingly (e.g., see (Kalra et al., 13 Jun 2024)). Warm-up mitigates this by gradually increasing the learning rate from near-zero, so that the update size remains below the instability threshold while the optimizer navigates and (through so-called "catapult" events) actively reduces sharpness ("sharpness reduction phase").

This mechanism is supported by phase diagrams that show improved reliability and safe use of higher target learning rates $\eta_{\text{target}}$ with warm-up, observed in both SGD and Adam settings. The transition can involve progressive sharpening, sharpness reduction, or near-constant sharpness, determined by model initialization and parameterization. In all cases, warm-up ensures the optimizer safely enters regimes where large learning rates promote exploration and accelerate loss reduction without causing divergence or instability. Notably, in adaptive optimizers (e.g., Adam), improper bias correction and initially high update magnitudes make warm-up even more critical for avoiding large, destabilizing steps (Ma et al., 2019, Kosson et al., 31 Oct 2024).

2. Mathematical Foundations and Generalized Smoothness

Recent theoretical advances have established that the benefits of warm-up are tightly linked to relaxation of classical smoothness assumptions. Instead of requiring global $L$ -smoothness, several works postulate that local curvature is a decreasing function of the current loss suboptimality. For example, (Liu et al., 9 Sep 2025) introduces a $(\rho, K_0, K_\rho)$ -smoothness model:

$\|\nabla^2 f(w)\| \leq K_0 + K_\rho (f(w) - f^*)^\rho,$

while (Alimisis et al., 3 Oct 2025) uses a linear-in-loss form:

$\|\nabla^2 f(w)\|_2 \leq H_0 + H_1 (f(w) - f^*).$

Empirical estimates confirm this scenario for both vision and LLMs, as the local Hessian norm decays linearly with the loss during training, except for an initial transient.

The optimal learning rate thus increases as training progresses and the objective becomes smoother—a fact exploited directly by warm-up. In this context, warm-up can be understood as a dynamically scheduled step-size that adapts to locally decreasing curvature. Complexity results show that, under these assumptions, using warm-up can yield up to $\Theta(T)$ acceleration in deterministic gradient descent compared to any fixed step-size schedule, especially when starting far from optimality (Liu et al., 9 Sep 2025, Alimisis et al., 3 Oct 2025).

3. Design and Empirical Comparison of Warm-Up Schedules

Warm-up schedules appear in many variants:

Linear warm-up: $\eta_t = \eta_{\text{init}} + (\eta_{\text{target}} - \eta_{\text{init}})\, t / T_{wr}$ for $t \leq T_{wr}$ .
Exponential warm-up: $\omega_t^{\text{expo}, \tau} = 1 - \exp(-t/\tau)$ , with $\tau$ chosen to match adaptive decay timescales (Ma et al., 2019).
Double or piecewise linear warm-up: Increasing the learning rate in two linear phases, from zero to an intermediate value and then up to the peak (Gaido et al., 29 May 2025).
Sub-exponential warmup: $\eta_i = \eta \frac{\exp(\alpha (i/w) - 1)}{\exp(\alpha) - 1}$ , controlling the rate of increase (Gaido et al., 29 May 2025).
Warm restart cycles: As in SGDR, repeatedly annealing the learning rate from a high maximum to a minimum and then restarting (cosine annealing), supporting exploration and enabling snapshot ensembling (Loshchilov et al., 2016).
Warmup–stable–decay (WSD): Linear or exponential ramp-up, constant plateau, and scheduled decay (often cosine or exponential), prevalent in large-scale pretraining (Liu et al., 6 Jul 2025, Li et al., 23 Sep 2025).

A representative table of implementations:

Schedule Type	Formula	Typical Usage Context
Linear Warm-Up	$\eta_t = \eta_{\text{target}} \cdot \min\{1, t/\tau\}$	Adam, LLMs, large-scale vision models
Exponential	$\eta_t = \eta_{\text{target}} (1-\exp(-(1-\beta_2)\,t/2))$	Adaptive optimizers; theory-matching
Piecewise Linear	See (Gaido et al., 29 May 2025)	Speech-to-text, deep encoder models
Cosine Annealing	$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})(1+\cos(\pi T_{cur}/T))$	SGDR, vision models

Empirical results consistently show that properly tuned warm-up schedules stabilize training (preventing divergence or gradient explosion), enable the safe application of high peak learning rates, accelerate initial loss reduction, and, in large datasets, facilitate better anytime performance. However, final performance is usually determined mostly by post-warmup (plateau/decay) scheduling and longer training (Gaido et al., 29 May 2025).

4. Theoretical and Empirical Impact on Optimization Dynamics

Warm-up directly impacts several intertwined aspects of optimization in deep learning:

Transition to Flatter Solutions: By inducing a loss "catapult" or sharpness reduction, warm-up prevents the optimizer from being trapped in sharp, poorly conditioned minima. This mechanism is seen in phase diagrams, and is mathematically captured by the adaptation of the stability threshold with warm-up (Kalra et al., 13 Jun 2024).
Mitigation of Large Early Updates: Warm-up limits the effective update size $\Delta \mathbf{w}_t = \eta_t \mathbf{u}_t$ in parameter space, counteracting effects such as momentum bias corrections, high initial gradient SNR, and unstable changes in representations or weights. Controlling the $\ell_2$ -norm, angular rotation, and relative representation change (RRC) reduces the need for ad hoc warm-up (Kosson et al., 31 Oct 2024).
Ensemble Formation & Exploration: Cyclic or restart-based schedules (e.g., SGDR) facilitate escape from local minima and enhance ensemble diversity by repeatedly traversing different regions of the loss surface (Loshchilov et al., 2016).
Optimization in Ill-Conditioned Landscapes: The presence of sharp "valley/river" structures (directional separation of fast/slow curvature) explains, through analogy to the Mpemba effect, why warming up and plateauing at a high learning rate prior to decay can accelerate convergence of the slowest modes (Liu et al., 6 Jul 2025).

5. Extensions: Warm-Up in Adaptive, Layerwise, and Continual Training

In adaptive optimizers such as Adam, warm-up is foremost a means to counteract initially large, highly irregular update magnitudes stemming from bias correction and adaptive scaling—requiring longer or specifically tuned warm-up periods relative to SGD (Ma et al., 2019). In large batch settings, alternative approaches such as Complete Layer-wise Adaptive Rate Scaling (CLARS) use extra normalization terms to adaptively control per-layer learning rates, thereby avoiding instability without explicit warm-up (2002.01576).

In continual pre-training, "re-warming" the learning rate is essential when starting from a converged checkpoint, as adaptation to new data may otherwise stagnate. However, the exact length of re-warmup is less critical compared with the adjustment of the maximum learning rate—a larger warm-up/peak learning rate boosts plasticity and downstream adaptation but can also induce catastrophic forgetting of upstream data (Gupta et al., 2023).

6. Alternative and Emerging Theoretical Frameworks

Several recent works offer alternative mechanistic explanations and new theoretical paradigms:

Functional Scaling Laws (FSL): In the analysis of SGD dynamics under LLM-relevant regimes, the FSL framework decomposes the population risk into deterministic bias, decaying gradient term, and a learning rate–dependent noise convolution. A WSD schedule (long plateau before sharp decay) is theoretically justified as it optimally balances rapid risk reduction and noise forgetting, outperforming direct decay and constant-rate schedules (Li et al., 23 Sep 2025).
Optimal Decay without Warm-up: In some regimes (notably, large compute-optimal LLM pretraining), a simple linear decay-to-zero schedule (D2Z), possibly with a short or no warm-up, outperforms traditional warm-up plus cosine decay, explained via an EMA interpretation of AdamW balancing early bias reduction and late-stage variance averaging (Bergsma et al., 21 Feb 2025).
Self-regulating Adapters: Algorithms such as D-Adaptation inherently embed a “warm-up” by adaptively scaling the effective learning rate based on dual-averaged gradient information, eliminating the need for externally designed warm-up phases (Defazio et al., 2023).

7. Limitations, Practical Recommendations, and Open Challenges

Current evidence indicates that the primary benefit of learning rate warm-up is robustly enabling larger and more aggressive learning rates by steering the model through high-curvature regimes and stabilizing updates, rather than providing direct gains in final error or test accuracy. The design and tuning of warm-up schedules can be simplified: empirical studies show that schedule type (linear vs. exponential vs. piecewise linear) and precise length are often secondary in importance to the control of update size and safe traversal of sharp loss regions (Gupta et al., 2023, Gaido et al., 29 May 2025). Layerwise or curvature-aware tuning can sometimes eliminate the need for warm-up altogether (2002.01576, Roulet et al., 8 Jul 2024).

A plausible implication is that improved optimizer initialization (e.g., GI-Adam), adaptive, curvature-matched scheduling, or built-in mechanisms controlling the effective representation change may reduce or replace the need for explicit warm-up (Kalra et al., 13 Jun 2024, Kosson et al., 31 Oct 2024). Nonetheless, when training extremely overparameterized or ill-conditioned models, or in high-batch, high-momentum regimes, warm-up remains practically indispensable.

Further research directions involve (i) automating schedule adaptation based on real-time curvature or stability estimates, (ii) extending theoretical complexity bounds to realistic nonconvex, stochastic regimes matching deep learning practice, and (iii) reconciling warm-up scheduling with large-scale scaling laws and compute optimality frameworks in LLMs and multimodal models.

In summary, learning rate warm-up is now solidly grounded as a principled—rather than ad hoc—component in the optimization of deep neural networks. Its core function is to adapt update magnitudes to the non-uniform curvature of the loss surface, safeguarding against instability and unlocking the efficiency of high learning rates. Theoretical, empirical, and practical evidence converge on its necessity for robust and efficient training in both conventional and frontier model architectures.