Cosine Learning Rate Decay
- Cosine learning rate decay is a scheduling method that uses the upper half of a cosine function to smoothly reduce the learning rate from a high initial value to a lower final value.
- The approach achieves a balance between rapid early training progress and gradual convergence, with its behavior accurately modeled by frameworks like the Multi-Power Law.
- Practical implementation involves tuning parameters such as warmup duration and decay horizon, while its fixed-budget nature often spurs comparisons with more flexible alternative schedules.
Cosine learning rate decay is a widely used learning rate schedule in large-scale pretraining of deep neural networks. This schedule decays the learning rate according to the upper half of a cosine function, providing a smooth transition from a high learning rate at the beginning of training to a lower value at the end. Its design balances rapid initial progress with slow, stable convergence, but also introduces constraints and limitations that have driven the development of alternative schedules and analytical frameworks.
1. Definition and Functional Form
Cosine learning rate decay is defined mathematically as:
where:
- is the initial (peak) learning rate,
- is the final (floor) learning rate,
- denotes the current training step,
- is the total number of steps over which the decay is applied.
Typical practice is to precede this decay by a linear warmup period, during which the learning rate increases linearly from zero to over the initial steps. At , , generating a monotonic, smooth decay curve from peak to floor learning rate (Wen et al., 2024, Attia et al., 12 Mar 2025, Meterez et al., 3 Feb 2026).
2. Theoretical and Empirical Properties
Cosine decay is motivated by the goal of enabling aggressive early progress (via high learning rates) while ensuring convergence and fine-tuning (via gradual annealing). Key theoretical results associated with cosine annealing include:
- Convergence Rates: For convex objectives, the final suboptimality satisfies , where 0 quantifies learning rate misspecification and 1 is the training step count. The sublinear 2 scaling (for 3 polynomial decay, as in cosine) reflects robustness absent from fixed-step methods, which degrade linearly with 4 (Attia et al., 12 Mar 2025).
- Loss Curve Characterization: Empirical studies show that cosine decay yields loss curves consistent with a power-law in the cumulative learning rate, with extra reductions induced by the schedule's continuous decrease (Luo et al., 17 Mar 2025). Predictive models such as the Multi-Power Law (MPL) can fit and forecast the loss trajectory under cosine decay with high accuracy (5).
Empirically, cosine decay matches or slightly underperforms recent alternatives (e.g., Warmup-Stable-Decay, constant-plus-cooldown) for fixed training budgets, especially when compared at their optimal hyperparameters (Wen et al., 2024, Bergsma et al., 21 Feb 2025, Meterez et al., 3 Feb 2026).
3. Comparison with Alternative Schedules
Recent literature contextualizes cosine decay within a broader ecosystem of learning rate schedules. Direct competitors include:
| Schedule Type | Regime | Horizon Dependency | Practical Remarks |
|---|---|---|---|
| Cosine Decay | Fixed-budget | Horizon-fixed | Requires 6 in advance; not "anytime"; strong for single endpoint (Wen et al., 2024, Meterez et al., 3 Feb 2026) |
| Linear Decay-to-Zero (D2Z) | Fixed-budget | Horizon-fixed | Outperforms cosine for high-TPP pretraining, robust to batch/scale (Bergsma et al., 21 Feb 2025) |
| Warmup-Stable-Decay (WSD) | Compute-agnostic | Horizon-free | Enables branching to any budget, reusable runs, marginal loss improvement (Wen et al., 2024) |
| Checkpoint Averaging (e.g., WSM, SWA) | Horizon-free | Horizon-free | Emulates or surpasses cosine via merging, flexible, less compute (Tian et al., 23 Jul 2025, Hägele et al., 2024) |
| 1/7, Const+EMA | Horizon-free | Horizon-free | Matches tuned-cosine across budgets with one hyperparameter set (Meterez et al., 3 Feb 2026) |
Cosine decay is suboptimal when repeated runs for scaling law studies or multi-budget checkpointing are desired. Schedules supporting post-hoc merging or anytime training alleviate these limitations, deliver similar final losses, and dramatically reduce compute requirements (Hägele et al., 2024, Meterez et al., 3 Feb 2026, Tian et al., 23 Jul 2025).
4. Limitations and Trade-Offs
Cosine decay exhibits several structural limitations:
- Horizon Dependence and Non-Anytime Character: Cosine schedules require pre-specifying 8 and perform optimally only at this horizon. Intermediate checkpoints underperform compared to separate, appropriately-sized cosine runs (Hägele et al., 2024, Meterez et al., 3 Feb 2026).
- Lack of Flexibility in Continual/Long-Horizon Regimes: In continual pretraining, repeated cosine restarts induce instability and catastrophic forgetting by injecting large gradient steps. Infinite-style schedules that forgo restarts and decays mitigate these effects (Singh et al., 4 Mar 2025).
- Computational Redundancy: Scaling law and checkpointing experiments under cosine decay require redundant full runs; constant+cooldown or weight averaging can reach equivalent minima using one run and post-hoc analysis (Hägele et al., 2024, Tian et al., 23 Jul 2025).
- Sensitivity to Learning Rate and Tuning: While cosine annealing offers robustness to coarse base learning rate tuning (Attia et al., 12 Mar 2025), optimal performance still requires grid search and schedule-dependent tuning, especially outside the high-TPP regime (Bergsma et al., 21 Feb 2025).
5. Analytical Frameworks and Loss Curve Modeling
Predictive modeling of pretraining loss under cosine decay has advanced understanding and schedule optimization:
- Multi-Power Law (MPL): Loss at step 9 is described as 0, with 1 the sum of learning rates and 2 additional reduction from the schedule's decay (Luo et al., 17 Mar 2025).
- Optimization in Schedule Space: Differentiating the MPL surrogate with respect to the full learning rate schedule allows direct search for schedules outperforming pure cosine. Empirically, the optimal schedule closely resembles a prolonged plateau (stable phase) followed by a power-law decay, slightly different from vanilla cosine (Luo et al., 17 Mar 2025).
- River-Valley Loss Landscape: Cosine and WSD performance are interpretable via a geometric picture where progress along the "river" direction is achieved with high learning rate (oscillatory), while decay serves to concentrate iterates near the loss basin (reducing the "hill" component) (Wen et al., 2024).
6. Practical Recommendations and Guidelines
Best practices for cosine decay and its competitors are established across works:
- Cosine Decay: Remains a strong choice for fixed-length, production-grade pretraining, provided 3 (token budget) is pre-specified and both warmup and decay durations are tuned. Common settings include 4, 5–6 of total steps, 7 (Wen et al., 2024, Hägele et al., 2024).
- Anytime or Multi-Checkpoint Regimes: Prefer WSD-S, WSM, or averaging methods (e.g., constant 8 cooldown, SWA, model merging) for efficiency and flexibility (Wen et al., 2024, Tian et al., 23 Jul 2025, Hägele et al., 2024, Meterez et al., 3 Feb 2026).
- Avoiding Cosine in Continual Learning: For continual (multi-phase) pretraining, avoid cosine restarts; use infinite schedules without rewarm or checkpoint pre-decay (Singh et al., 4 Mar 2025).
- Schedule Selection: For high-TPP LLM regimes, linear warmup plus decay-to-zero is empirically superior. For low-TPP or when 9 is unknown, inverse-sqrt or anytime (horizon-free) schedules are preferable (Bergsma et al., 21 Feb 2025, Meterez et al., 3 Feb 2026).
7. Historical and Methodological Context
The cosine learning rate schedule, originally formalized in SGDR (Loshchilov & Hutter, 2016), rose to prominence as the default for LLM pretraining. Its adoption is widespread in major model releases (e.g., LLaMA 3), driven by its empirical effectiveness and simplicity (Meterez et al., 3 Feb 2026). Subsequent research has focused on analytical characterizations, empirical benchmarking, and alternatives designed to address its intrinsic horizon dependency and inefficiency in multi-budget, continual, or scaling-law contexts (Wen et al., 2024, Tian et al., 23 Jul 2025, Hägele et al., 2024, Attia et al., 12 Mar 2025, Bergsma et al., 21 Feb 2025, Luo et al., 17 Mar 2025, Singh et al., 4 Mar 2025, Meterez et al., 3 Feb 2026).
A plausible implication is that, while cosine decay remains central to current practice, its monopoly is being eroded by flexible, analytically guided alternatives that permit compute re-use, schedule-free adaptation, and theoretically grounded robustness. The trend in state-of-the-art pretraining increasingly favors schedules compatible with horizon-agnostic training and post-hoc evaluation, especially in resource-constrained or scaling-oriented workflows.