Schedule-Free AdamW: Adaptive Optimization

Updated 26 October 2025

Schedule-Free AdamW is an adaptive optimization framework that eliminates explicit learning rate scheduling using iterate averaging, proximal methods, and generalized magnitude scaling.
It provides rigorous convergence guarantees and robustness in high-dimensional settings, as evidenced by theoretical analyses and large-scale empirical benchmarks.
The approach simplifies training pipelines by reducing hyperparameter tuning, enhancing memory efficiency, and offering drop-in replacement for classical AdamW.

Schedule-Free AdamW is an umbrella term for adaptive optimization schemes derived from AdamW that eliminate the dependency on explicit learning rate scheduling. These methods replace, augment, or reinterpret learning rate schedules using theoretical advances (iterate averaging, proximal methods, masking, structure-aware adaptation), which provides robustness across problem scales and architectures—particularly for deep neural networks and LLMs. Schedule-Free AdamW encompasses innovations in algorithm design, convergence analysis, memory efficiency, and scaling laws, with empirical support from large-scale optimization benchmarks and theoretical connections to accelerated and averaged SGD variants.

1. Algorithmic Foundations of Schedule-Free AdamW

Classical AdamW is formulated as an adaptive gradient optimizer with decoupled weight decay, requiring learning rates scheduled to decrease during training for stability and convergence. Schedule-Free AdamW algorithms forgo predefined decay phases or stopping times, instead integrating theoretical constructs such as:

Online iterate averaging: As in Schedule-Free AdamW, the central updates are

$z_{t+1} = z_t - \gamma_t \frac{g_t(y_t)}{\sqrt{v_t} + \epsilon} - \gamma_t \lambda y_t \ x_{t+1} = (1-c_{t+1}) x_t + c_{t+1} z_{t+1}$

where $y_t = (1-\beta_1)z_t + \beta_1 x_t$ , and $c_{t+1}$ is set (e.g., $1/(t+1)$ or proportional to $\gamma_t^2$ ), subsuming both momentum and averaging into a unified update. This removes the need for schedule design (Defazio et al., 24 May 2024), allowing the optimizer to perform robustly without foreknowledge of training duration $T$ .

Proximal interpretation and scale-freeness: AdamW replicates the behavior of a first-order approximation to the proximal map for decoupled L2 regularization; the weight decay is applied outside the adaptive gradient step. This application ensures invariance to coordinate-wise gradient scaling, yielding updates of the form

$x_{t+1} = (1 - \lambda \eta_t) x_t - \eta_t \frac{\widehat{m}_t}{\sqrt{\widehat{v}_t} + \epsilon}$

(Zhuang et al., 2022). The scale-freeness removes the sensitivity to gradient magnitude and schedule-induced instability, improving robustness in layered models.

Generalized magnitude scaling (Aida): By extending AdamW to track higher moments ( $r_t$ of $|g_t|^p$ ) and modifying the numerator with a $q$ -powered first moment, the update

$x_{t+1} = (1-\mu)x_t - \eta \frac{|m_{t+1}|^q \cdot \mathrm{sign}(m_{t+1})}{(r_{t+1} + \epsilon)^{q/p}}$

yields local stability independent of $\eta$ for $q > 1,\, p > 1$ under a nonzero weight decay ( $\mu > 0$ ), thus freeing the optimizer from scheduled decay (Zhang et al., 2021).

2. Convergence Theory and Stability Analysis

Schedule-Free AdamW variants leverage theoretical advances that relax or redefine learning rate constraints:

Implicit norm constraints and KKT analysis: If AdamW runs with any non-increasing learning rate $\{\eta_t\}$ whose sum diverges, every convergent subsequence must converge to a KKT point of constrained minimization

$\min_\theta L(\theta) \quad \text{s.t.} \quad \|\theta\|_\infty \leq 1/\lambda$

(Xie et al., 5 Apr 2024). The optimizer's implicit bias leads to solutions within an $\ell_\infty$ ball, justified by the Frank-Wolfe connection and the sign-gd asymptotics.

Averaging equivalence and optimal rates: Schedule-Free optimizers unify averaging and scheduling, reproducing worst-case optimal rates by online-to-batch conversion (Defazio et al., 24 May 2024, Morwani et al., 4 Feb 2025). For convex objectives, this yields

$\mathbb{E}[F(x_T) - F(x_\star)] \leq \frac{DG}{\sqrt{T}},$

with $D$ the diameter and $G$ the Lipschitz constant, matching Polyak-Ruppert theory.

Dimension and iteration scaling: For high-dimensional networks, AdamW achieves

$\frac{1}{K}\sum_{k=1}^K \mathbb{E}[\|\nabla f(x^k)\|_1] \leq O\left(\frac{\sqrt{d}\,C}{K^{1/4}}\right)$

(Li et al., 17 May 2025), with $d$ the parameter dimension, $C$ the problem-dependent constant, and $K$ iterations. This bound is analogous to SGD's optimal $O(C/K^{1/4})$ scaling in $\ell_2$ norm, up to the natural $\sqrt{d}$ factor from converting between norms—validating the schedule-free paradigm even as $d\to\infty$ .

Loss landscape tracking ("river" flow): Schedule-Free AdamW is shown empirically to follow the low-curvature "river" of the landscape, operating at the "Edge of Stability" without explicit decay phases or separate averaging copies (Song et al., 14 Jul 2025). The method's update implicitly averages iterates in the most important direction, automatically stabilizing training in large-scale scenarios.

3. Empirical Results and Benchmarking

Experimental evidence across schedule-free designs demonstrates strong performance on vision, language, and recommendation models:

Deep learning tasks: In (Zhang et al., 2021), Aida with $(p,q) = (1,2)$ outperforms classical AdamW on Transformer-based translation and Swin-Transformer vision tasks, delivering $3\%$ higher validation accuracy and improved generalization. This is achieved without the need for fine-grained learning rate schedules.
Memory-efficient scaling: APOLLO (Zhu et al., 6 Dec 2024) introduces a structured, low-rank approximation for AdamW's adaptive scaling, reducing optimizer state memory from $2mn$ to $2nr$ (for a $m \times n$ matrix, rank $r$ ). APOLLO-Mini (rank-one) supports training LLaMA-7B models on single 12GB GPUs without schedule tuning, matching or exceeding AdamW's perplexity/convergence.
Optimizers with enhanced convergence: AlphaAdam (Chang et al., 30 Jan 2025) uses intra-layer asynchronous masking and dynamic compensation (alpha scaling) to adapt update strength without learning rate scheduling. The method demonstrates improved convergence speed and lower computation on GPT-2 and RoBERTa fine-tuning.
Algorithmic competitions: Schedule-Free AdamW forms the core of the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge winning entry, delivering state-of-the-art accuracy and anytime optimization (Defazio et al., 24 May 2024), with open-source implementation supporting integration into existing pipelines.

4. Structural Innovations and Variants

Several algorithmic architectures instantiate the schedule-free principle:

Iterate averaging and momentum coupling: Vanilla Schedule-Free AdamW uses interpolation mixing $(c_t = 1/t)$ and momentum $(\beta_1)$ to blend rapid updates with stability, shown to recover optimal rates for constant learning rates (Defazio et al., 24 May 2024).
Refined variants for batch scaling: Decoupling the momentum and averaging window by introducing an additional control parameter $C$ allows the averaging weights

$\alpha_t \approx \frac{C}{T}\left(\frac{t}{T}\right)^{C-1}$

yielding robustness to $\beta_1$ variation and better performance for large-batch regimes (Song et al., 14 Jul 2025).

Accelerated SGD connections: Schedule-Free AdamW and AdEMAMix are theoretically shown to resemble accelerated SGD variants under noise, with explicit connections between momentum and weight averaging coefficients (Morwani et al., 4 Feb 2025). The empirical superiority of AdEMAMix (and its simplified single-momentum version) in noisy regimes further validates the schedule-free approach.
Nesterov momentum and curvature-adaptive steps: AdaPlus (Guan, 2023) integrates Nesterov-style momentum and AdaBelief-inspired curvature detection, producing learning rate adaptation that does not depend on schedules and does not add extra hyperparameters.

5. Scale-Freeness, Implicit Bias, and Regularization

Schedule-Free AdamW preserves, and in some cases enhances, AdamW's scale-invariant and bias properties:

Scale-freeness: The AdamW update is shown to be invariant under coordinate-wise gradient rescaling, which stabilizes updates across layers with disparate scales (Zhuang et al., 2022). Empirically, methods relying on explicit regularization (Adam- $\ell_2$ ) degrade under rescaling, whereas AdamW and its schedule-free variants maintain performance.
Implicit bias toward sparsity and stability: Decoupled weight decay results in optimization within an $\ell_\infty$ norm ball, favoring solutions confined to controlled coordinate magnitudes (Xie et al., 5 Apr 2024). This mechanism can guard against overfitting and instability, especially in LLMs with massive parameter counts.
Memory and computationally efficient regularization: Structured adaptation (APOLLO, AdaPlus, AlphaAdam) yields state-of-the-art performance at low memory cost, supporting larger batches, improved throughput, and democratized access to GPU-limited experimentation (Zhu et al., 6 Dec 2024).

6. Practical Considerations and Pipeline Integration

Schedule-Free AdamW techniques offer actionable advantages for model training workflows:

Hyperparameter simplification: Removal of learning rate decay schedules and tuning reduces experiment complexity. The schedule-free approach works with a single base learning rate and standard AdamW hyperparameters (Defazio et al., 24 May 2024).
Anytime optimization and Pareto tracking: Schedule-Free AdamW automatically tracks a Pareto frontier between loss and iteration, so models can be deployed at different training stages without re-running for alternate schedules.
Compatibility: Algorithms are implemented as drop-in replacements for standard AdamW, requiring minimal pipeline modification, and benefit from open-source codebases for Schedule-Free, APOLLO, AlphaAdam, and AdaPlus.
Robustness and generalization: Theoretical and empirical evidence demonstrates stability and generalization properties under constant learning rates, with empirical superiority in tasks lacking normalization layers, and scalability to extremely large batch and dataset sizes (Zhuang et al., 2022, Song et al., 14 Jul 2025).

7. Limitations and Open Challenges

While schedule-free methods provide theoretical and practical benefits, some caveats exist:

Weight decay dependence: Certain variants (e.g., Aida with $q > 1$ ) require nonzero weight decay for local stability, potentially limiting their use in settings where $\mu = 0$ is preferred (Zhang et al., 2021).
Batch-size sensitivity: Some forms couple momentum and averaging windows, resulting in suboptimal performance in large-batch or high-noise settings unless parameters are decoupled (Morwani et al., 4 Feb 2025, Song et al., 14 Jul 2025).
Edge cases for convergence: Under extremely small noise variance, faster convergence rates ( $O(1/\sqrt{K})$ ) may be achievable, though schedule-free bounds (typically $K^{-1/4}$ ) are optimal under noisy conditions (Li et al., 17 May 2025).
Empirical tuning in absence of theoretical guidance: While default parameters provide robust performance, certain regimes (e.g., very large models or datasets) may still require empirical adjustment of base learning rates, momentum, or mixing parameters $C$ (Defazio et al., 24 May 2024, Song et al., 14 Jul 2025).

In summary, Schedule-Free AdamW integrates innovations in theoretical analysis, algorithmic structure, and empirical benchmarking to realize optimization schemes that are robust to schedule design, scalable across model and batch sizes, and compatible with modern deep learning pipelines. The approach subsumes advances in proximal interpretation, scale-freeness, implicit bias, and accelerated averaging, and is supported by open-source releases and comprehensive experimental evidence across vision, language, and recommendation tasks.