Papers
Topics
Authors
Recent
Search
2000 character limit reached

ScheduleFree+: Learning-Rate-Free Optimization

Updated 22 May 2026
  • ScheduleFree+ is a schedule-free, learning-rate-free optimization framework that simplifies training of large-scale language models and deep neural networks by eliminating manual tuning of learning rates.
  • It leverages online-to-batch conversion and Polyak averaging to deliver predictable, anytime convergence, outperforming traditional schedules in both speed and stability.
  • Empirical evaluations reveal that ScheduleFree+ achieves faster convergence and robust performance across diverse scales, from LLM pretraining to physical modeling tasks.

ScheduleFree+ is a schedule-free, learning-rate-free optimization framework designed for efficient, robust, and minimal-tuning training of LLMs and deep neural networks at scale. Rooted in the theoretical framework of online-to-batch conversion and Polyak averaging, ScheduleFree+ extends the original Schedule-Free methodology to deal with the unique challenges of LLM-scale training, including large batch sizes, extreme parameter counts, and the requirement for predictable, “anytime” convergence. Empirical evidence demonstrates that ScheduleFree+ achieves faster convergence and greater stability compared to canonical learning-rate schedules, particularly in long-duration, high-budget LLM training scenarios (Defazio, 18 May 2026).

1. Theoretical Basis and Algorithmic Structure

ScheduleFree+ builds on the unification of iterate averaging and step-size scheduling originally formalized in the Schedule-Free framework (Defazio et al., 2024), leveraging a general weighted average of “fast” iterates ztz_t to produce a stable, best-so-far output xtx_t. In the base variant, the AdamW update is cast as

zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),

with Gt(yt)G_t(y_t) the stochastic gradient evaluated at an interpolation point yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t. The output xt+1x_{t+1} is formed via incremental averaging:

xt+1=(1ct+1)xt+ct+1zt+1,x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},

with averaging weight ct+1c_{t+1} determined by Polyak-based statistics or preset functional forms.

ScheduleFree+ extends this by introducing the following key components:

  • Inner Adam-style momentum (β1)(\beta_1) for added stability in high-batch regimes.
  • Polyak-style adaptive step-size (ηt)(\eta_t) based on real-time estimates of the objective gap and gradient norm via exponential moving averages.
  • Averaging buffer warm-start, holding xtx_t0 for an initial phase to prevent norm collapse in early iterations.
  • Annealed outer momentum xtx_t1: begins with low xtx_t2 for fast early progress, gradually increased to xtx_t3 to prioritize smoother convergence in long runs.

The step-size xtx_t4 is computed per-iteration following

xtx_t5

where xtx_t6 is the Polyak denominator estimated as the corrected EMA of the xtx_t7-gradient norm, converting to an xtx_t8 estimate via a xtx_t9 factor.

This structure eliminates all learning-rate and decay hyperparameters except for the moments and decay constants inherited from AdamW (Defazio, 18 May 2026).

2. Foundations in Averaging and Regret Analysis

ScheduleFree+ arises directly from minimax optimal weighted-regret bounds in stochastic optimization. For convex zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),0 with minimizer zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),1, the method guarantees

zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),2

with optimality (in the canonical SGD setting) achieved when zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),3. As the true gradient norm is typically unavailable in nonconvex regimes, ScheduleFree+ employs an EMA proxy, as in the Polyak step-size heuristic (Defazio, 18 May 2026). Unlike fixed-schedule step methods, the averaging-driven contraction in the output zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),4 ensures zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),5 rates without horizon-dependent tuning.

On the practical side, ScheduleFree+ implements continuous interpolation between momenta and streaming average checkpoints, which empirically yields “anytime” optimization—zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),6 is the best available iterate at every step, addressing performance oscillations common in schedule-based methods.

3. Scaling and Robustness Enhancements

Despite strong small- and mid-scale performance, the original Schedule-Free AdamW exhibited loss divergence and instability at the scale of LLM training (e.g., batch sizes zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),7M tokens, zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),8B parameters). ScheduleFree+ addresses these by:

  1. Restoring inner momentum zt+1=ztηGt(yt),z_{t+1} = z_t - \eta \cdot G_t(y_t),9–Gt(yt)G_t(y_t)0 to suppress large-batch noise and cliff-like instabilities.
  2. Switching to “fully decoupled” AdamC: Weight decay Gt(yt)G_t(y_t)1 is applied via Gt(yt)G_t(y_t)2 (not just Gt(yt)G_t(y_t)3), enforcing stable weight and gradient norm evolution even as Gt(yt)G_t(y_t)4 adapts.
  3. Warm-starting the averaging buffer: Gt(yt)G_t(y_t)5 for Gt(yt)G_t(y_t)6 steps (typically Gt(yt)G_t(y_t)7–Gt(yt)G_t(y_t)8), after which averaging proceeds, preventing early norm collapse in Gt(yt)G_t(y_t)9 and improving initial loss behavior.
  4. Annealing outer momentum yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t0: Interpolated from yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t1 to yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t2 over the first 10–20% of epochs, balancing fast initial adaptation and later-stage smoothness.
  5. Time-weighted averaging yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t3 with yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t4 for long runs (tokens-per-parameter yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t5B), yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t6 for short runs, empirically matched to the scale of LLM pretraining.

These changes collectively enable near-constant gradient norms, predictable effective step sizes, and robust convergence across a range of batch and model sizes, as demonstrated empirically (Defazio, 18 May 2026).

4. Empirical Evaluation and Comparative Performance

ScheduleFree+ has been benchmarked on scaling ladders comprising 120M to 1B parameter Llama-3 style transformers, with batch sizes up to 4M tokens and sequence lengths of 2K tokens. The optimizer is consistently competitive with, and often superior to, state-of-the-art Linear-Decay and Warmup-Stable-Decay (WSD) schedules.

Key empirical findings include:

  • Long runs (1000 tokens/parameter): ScheduleFree+ reaches the same validation loss as tuned Linear-Decay in 31% fewer tokens; Linear-Decay required 45% more tokens to reach parity at 120M scale.
  • Medium/short runs (100/20 tpp): ScheduleFree+ outperforms or matches WSD, is generally comparable to Linear-Decay except for very short runs on the largest models where prolonged drift in early steps is limiting.
  • Convergence predictability: The loss yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t7 fits the form yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t8 outside the initial 5% burn-in, enabling accurate stopping/horizon planning by early-stage curve fitting.
  • Loss curve behavior: The adaptive yt=(1β)zt+βxty_t = (1-\beta)z_t + \beta x_t9 schedule provides smooth, monotonic optimization progress with no imposed decay, in contrast to the stepwise fluctuations and plateaux seen in canonical schedules (Defazio, 18 May 2026).

5. Integration into Foundation Model Fine-Tuning and Physical Modeling

ScheduleFree-style methods have shown practical benefits beyond LLM pretraining:

  • In atomistic foundation modeling, ScheduleFree with AdamW-style preconditioning and automatic global scaling xt+1x_{t+1}0 achieves superior force root mean-squared error (RMSE) and robust molecular dynamics (MD) stability, on par or better than AdamW and LAMB, and strictly superior to SGD, RAdam, and Ranger, as rigorously benchmarked in energy/force accuracy and physical observable fidelity (Liu et al., 5 Dec 2025).
  • The preconditioning framework endows ScheduleFree with nearly perfect Hessian spectrum flattening (xt+1x_{t+1}1), yielding improved convergence, lower friction along flat modes, and improved stability in strongly anisotropic landscapes, including solid–liquid interface dynamics and phonon spectra recovery.
  • ScheduleFree’s adaptation of Polyak averaging renders model averaging and checkpoint merging a theoretically principled procedure rather than a heuristic, supporting empirical gains found in recent “model soups” research (Defazio, 18 May 2026).

6. Hyperparameterization and Best Practices

ScheduleFree+ eliminates the need for hand-tuned learning-rate schedules, grid searches for decay strategies, or epoch-dependent scheduling:

  • Adam/AdamW parameters: Retains xt+1x_{t+1}2 for momentum/decay; no schedule hyperparameters.
  • Polyak EMA coefficient: xt+1x_{t+1}3 in xt+1x_{t+1}4 denominator suffices.
  • Average warmup: xt+1x_{t+1}5 warmup steps (empirically xt+1x_{t+1}6).
  • Weight decay xt+1x_{t+1}7: Used at xt+1x_{t+1}8–xt+1x_{t+1}9 standard values, always applied as xt+1=(1ct+1)xt+ct+1zt+1,x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},0.
  • No learning-rate grid search: Step-size xt+1=(1ct+1)xt+ct+1zt+1,x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},1 is computed online per iteration; users set only optimizer defaults (Defazio, 18 May 2026).
  • Separate xt+1=(1ct+1)xt+ct+1zt+1,x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},2 for evaluation: xt+1=(1ct+1)xt+ct+1zt+1,x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},3 yields the best-so-far model; xt+1=(1ct+1)xt+ct+1zt+1,x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},4 is not used in evaluation.

In fine-tuning workflows, ScheduleFree can be used as a drop-in replacement for AdamW with minimal changes. For strongly heterogeneous or highly anisotropic domains, a brief second-order refinement (e.g., L-BFGS post-processing) after ScheduleFree optimization is recommended but not required for homogeneous tasks (Liu et al., 5 Dec 2025).

7. Broader Impact and Methodological Significance

ScheduleFree+ exemplifies a principled shift away from heuristic, horizon-dependent schedule design toward theoretically grounded, uniformly convergent, and robust optimizer construction. By unifying Polyak heuristic steps, online-to-batch averaging, and advanced momentum design in a learning-rate-free framework, ScheduleFree+ simplifies practical optimization, reduces parameter-sensitivity, and enhances convergence predictability in both deep learning and scientific modeling contexts.

The approach is especially well-suited for scenarios with unpredictable compute budgets, massive model scales, or environments demanding anytime stop-ability and stability across diverse tasks—including but not limited to LLM pretraining and foundation model fine-tuning across scientific domains (Defazio, 18 May 2026, Liu et al., 5 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScheduleFree+.