Learning Rate Decay in LLM Pre-training
- Learning rate decay in LLM pre-training is a strategy that gradually reduces the optimizer’s step size to balance rapid initial improvements with stable convergence.
- Empirical studies reveal that schedules like Linear D2Z and Warmup-Stable-Decay can lower perplexity and achieve up to 60% compute savings under known training budgets.
- Adaptive methods and model averaging allow flexible emulation of decay benefits, enhancing downstream performance without traditional decay limitations.
Learning rate decay is a central technique for managing optimization dynamics during LLM pre-training. It aims to balance the need for rapid initial progress and robust convergence by manipulating the step size of the optimizer over the course of training. Multiple theoretical frameworks, empirical studies, and recent algorithmic advances have refined the understanding and practice of learning rate decay in LLM pre-training. This article synthesizes the main paradigms, theoretical justifications, modern alternatives—including adaptive and decay-free methods—their impact on training and downstream tasks, and best practices grounded in recent arXiv literature.
1. Paradigms and Schedule Taxonomy
The dominant learning rate decay paradigms in LLM pre-training reflect attempts to approximate optimal bias-variance tradeoffs for stochastic optimization and to manage the nonconvex structure of neural loss landscapes:
- Cosine and Step Decay: The learning rate is annealed according to a fixed curve (e.g., cosine, linear, multi-step), typically following a brief linear warmup phase. For cosine:
- Linear Decay-to-Zero (D2Z): The learning rate is decreased linearly from to zero post-warmup. D2Z consistently outperforms 10 cosine or step decays in large-scale empirical studies, yielding lower pre-training loss and substantial compute savings when the pre-training budget is known (Bergsma et al., 21 Feb 2025).
- Warmup-Stable-Decay (WSD): Training proceeds by warmup long constant plateau (“stable”) at fast decay near the end. WSD and its variants match or outperform budget-tuned cosine schedules and allow checkpointing at flexible positions (Wen et al., 2024).
- Power-Law and Intrinsic Optimal Schedules: Theoretically optimal decay is either a sharp power-law tail or a WSD profile depending on task and model exponents, as formalized via the Functional Scaling Law (FSL) framework (Li et al., 23 Sep 2025, Li et al., 6 Feb 2026).
- Constant LR with Model Averaging: Instead of decaying the learning rate, checkpoint merging (uniform or weighted) simulates the effect of decay in a “decay-free” regime, providing flexibility and often improving downstream/fine-tuned task performance (Tian et al., 23 Jul 2025, Luo et al., 24 Nov 2025).
- Infinite/Indefinite Schedules: “Infinite” cosine or inverse-sqrt schedules eschew a fixed training budget, enabling seamless continuation or continual pre-training without losses due to LR re-warming (Singh et al., 4 Mar 2025, Ibrahim et al., 2024).
A summary table of canonical schedule formulas:
| Schedule | Formula (post-warmup) | Key Properties |
|---|---|---|
| Cosine | Smooth decay; fixed budget | |
| Linear D2Z | Outperforms cosine when known | |
| WSD | Stable then decay near end | Budget-agnostic branching |
| Constant+Merge | ; merge last checkpoints | Decay-free; maximizes adaptability |
| Infinite | Warmup, cooldown, constant plateau | Flexible continual training |
2. Theoretical Foundations
The mathematical rationale for decaying learning rates is articulated through continuous-time and SDE treatments of stochastic gradient descent:
- Functional Scaling Laws (FSL): Under general kernel-regression and SGD models, FSL expresses the evolution of risk/loss as a sum of approximation, full-batch, and stochastic-noise terms, where each term’s exponent is controlled by model capacity and task hardness :
Here encodes the effective gradient noise, coupling learning rate and batch size (Li et al., 23 Sep 2025, Li et al., 6 Feb 2026).
- Phase Transition: WSD vs. Power Decay: In the easy-task regime (), the optimal LR schedule is a sharp power-law decay:
For harder tasks (), the minimax-optimal solution is a WSD profile: rapid warmup to the largest stable LR, hold constant almost all training, then decay over a small final fraction of steps (Li et al., 6 Feb 2026).
- Loss-Landscape Perspective: WSD’s behavior is explained by the river-valley analogy: in a highly anisotropic (valley-river) loss landscape, a high stable LR enables rapid “down the river” progress (low-curvature), while final decay eliminates loss due to oscillations in “hill” directions (high-curvature). The Mpemba effect justifies using a high plateau, ensuring that convergence in flat directions accelerates during the decay phase (Liu et al., 6 Jul 2025, Wen et al., 2024).
3. Adaptive and Decay-Free Methods
The rigidity of traditional decay schedules can be mitigated or replaced by dynamic adaptation or checkpoint merging:
- Adaptive Learning Rate Search (AdaLRS): AdaLRS tunes the LR on-line by optimizing loss descent velocity, using a sliding window least-squares slope estimate. Scaling factors (up) and (down) adaptively multiply/divide the current LR based on observed slope improvements. This approach provably converges to the optimal LR and corrects mis-specified LRs within $10$–$40$\% of training, outperforming standard schedules in wall-clock step savings (Dong et al., 16 Jun 2025).
- Checkpoint Merging (WSM): Any monotonically decreasing decay profile can be emulated by forming a weighted average of recent model checkpoints. Merge duration is the dominant hyperparameter. Mean or “1-sqrt” weighting outperforms exponential moving average, and WSM matches or exceeds WSD and cosine decay on multiple LLM benchmarks (Tian et al., 23 Jul 2025).
- Model Averaging in Curriculum Schedules: In curriculum-based LLM pre-training, aggressive LR decay nullifies the value of placing high-quality data late. Averaging the last few checkpoints under a constant LR recovers significant downstream benchmark gains (+0.74%–1.64%) (Luo et al., 24 Nov 2025).
4. Empirical Findings and Schedule Comparisons
Recent large-scale studies and benchmarks delineate the regimes where each schedule excels:
- Linear D2Z: Outperforms 10 cosine decay at scale—yielding lower perplexities and up to $60$\% compute savings at compute-optimal token-per-parameter (TPP) budgets (Bergsma et al., 21 Feb 2025).
- WSD and WSD-S: WSD consistently matches “oracle” cosine decay tuned to different pre-training budgets, while WSD-S, reusing branches, further improves loss by over WSD in multi-budget runs (Wen et al., 2024).
- WSM: On math, code, and professional benchmarks, constant LR with merge-based annealing (WSM) achieves +2–5% gains over WSD, and post-fine-tuning improvements transfer downstream (Tian et al., 23 Jul 2025).
- No-Decay Schedules for Fine-Tuning: Recent work shows that omitting decay (WSO: Warmup-Stable-Only schedule) during pre-training yields flatter minima in the loss landscape, leading to superior supervised fine-tuning (SFT) downstream—even when pre-training validation loss is slightly worse (Yano et al., 17 Mar 2026).
- Continual and Curriculum Regimes: For continual pre-training, infinite schedules or quick re-warm/re-decay procedures (followed by replay) enable Chinchilla-scale LLMs to match full re-training using a fraction of the compute (Ibrahim et al., 2024, Singh et al., 4 Mar 2025). For data curricula, moderate decay or model averaging unlocks the benefit of ascending-quality ordering (Luo et al., 24 Nov 2025).
Performance comparison (selected results):
| Study | Schedule(s) Compared | Metric / Result |
|---|---|---|
| (Bergsma et al., 21 Feb 2025) | Linear D2Z, 10Cosine | D2Z achieves 1% lower loss, 60% less compute at scale |
| (Dong et al., 16 Jun 2025) | AdaLRS vs. tuned Cosine | AdaLRS matches/betters PPL, 30–50% fewer steps if LR mis-set |
| (Yano et al., 17 Mar 2026) | WSO (no decay) vs. others | WSO yields highest SFT scores despite higher pre-training loss |
| (Tian et al., 23 Jul 2025) | WSM (merge) vs. WSD | +3.5% MATH, +5.5% MMLU-Pro eval gains |
| (Luo et al., 24 Nov 2025) | Const+merge vs. decay | +0.74–1.64% benchmark gain with SMA/EMA merging |
5. Special Topics: Batch Size, Curriculum, and Continual Training
- Batch Size Scheduling (Seesaw): In variance-dominated regimes (Adam, large-scale LLMs), halving LR is equivalent to multiplying batch size by (for SGD, equivalently by $2$). Seesaw schedules combine batch-ramps with LR decay, achieving the same loss curve with \% reduced wall-clock (Meterez et al., 16 Oct 2025).
- Curriculum & Data Quality: Aggressively decaying LR interferes with quality-driven curricula by shrinking update magnitude precisely when the best data arrives. Solutions include moderating LR decay, model averaging, or constant LR (Luo et al., 24 Nov 2025).
- Continual Pre-Training & Infinite Schedules: Repeated LR re-warming upon data shifts leads to forgetting “spikes.” Infinite schedules—warmup → cooldown → plateau—avoid sharp re-adaptation and support seamless extension (Singh et al., 4 Mar 2025, Ibrahim et al., 2024, Wang et al., 2024).
6. Practical Guidelines and Future Directions
- Estimate task hardness () and model capacity () via power-law fitting on small runs. Use power decay if , WSD if (Li et al., 6 Feb 2026).
- Prefer D2Z where the training budget is known and over-training is planned (Bergsma et al., 21 Feb 2025).
- For maximum adaptability (fine-tuning/SFT), consider Warmup-Stable-Only (constant LR after warmup) (Yano et al., 17 Mar 2026).
- For curriculum or high-quality data late in training, use moderate decay or constant LR plus checkpoint/model averaging (Luo et al., 24 Nov 2025, Tian et al., 23 Jul 2025).
- For continual or versioned LLM training, apply learning rate path switching: high constant LR for checkpointing, decayed LR for new data “branches” (Wang et al., 2024).
- Tune decay phase to $5$–$15$\% of training for WSD, exploit flexible branching (WSD-S), or abandon decay in favor of model merging/averaging where architecture and application permit (Wen et al., 2024, Tian et al., 23 Jul 2025).
Emerging trends suggest further theoretical refinement of optimal schedule shapes under nonconvex and adversarial pre-training, as well as the adoption of plug-in adaptive algorithms (e.g., AdaLRS) robust to mis-specified initial hyperparameters. The interplay between learning rate decay, batch size schedules, data quality curriculum, and model averaging remains an active area of optimization theory and LLM engineering.