ScheduleFree+: Learning-Rate-Free Optimization

Updated 22 May 2026

ScheduleFree+ is a schedule-free, learning-rate-free optimization framework that simplifies training of large-scale language models and deep neural networks by eliminating manual tuning of learning rates.
It leverages online-to-batch conversion and Polyak averaging to deliver predictable, anytime convergence, outperforming traditional schedules in both speed and stability.
Empirical evaluations reveal that ScheduleFree+ achieves faster convergence and robust performance across diverse scales, from LLM pretraining to physical modeling tasks.

ScheduleFree+ is a schedule-free, learning-rate-free optimization framework designed for efficient, robust, and minimal-tuning training of LLMs and deep neural networks at scale. Rooted in the theoretical framework of online-to-batch conversion and Polyak averaging, ScheduleFree+ extends the original Schedule-Free methodology to deal with the unique challenges of LLM-scale training, including large batch sizes, extreme parameter counts, and the requirement for predictable, “anytime” convergence. Empirical evidence demonstrates that ScheduleFree+ achieves faster convergence and greater stability compared to canonical learning-rate schedules, particularly in long-duration, high-budget LLM training scenarios (Defazio, 18 May 2026).

1. Theoretical Basis and Algorithmic Structure

ScheduleFree+ builds on the unification of iterate averaging and step-size scheduling originally formalized in the Schedule-Free framework (Defazio et al., 2024), leveraging a general weighted average of “fast” iterates $z_t$ to produce a stable, best-so-far output $x_t$ . In the base variant, the AdamW update is cast as

$z_{t+1} = z_t - \eta \cdot G_t(y_t),$

with $G_t(y_t)$ the stochastic gradient evaluated at an interpolation point $y_t = (1-\beta)z_t + \beta x_t$ . The output $x_{t+1}$ is formed via incremental averaging:

$x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},$

with averaging weight $c_{t+1}$ determined by Polyak-based statistics or preset functional forms.

ScheduleFree+ extends this by introducing the following key components:

Inner Adam-style momentum $(\beta_1)$ for added stability in high-batch regimes.
Polyak-style adaptive step-size $(\eta_t)$ based on real-time estimates of the objective gap and gradient norm via exponential moving averages.
Averaging buffer warm-start, holding $x_t$ 0 for an initial phase to prevent norm collapse in early iterations.
Annealed outer momentum $x_t$ 1: begins with low $x_t$ 2 for fast early progress, gradually increased to $x_t$ 3 to prioritize smoother convergence in long runs.

The step-size $x_t$ 4 is computed per-iteration following

$x_t$ 5

where $x_t$ 6 is the Polyak denominator estimated as the corrected EMA of the $x_t$ 7-gradient norm, converting to an $x_t$ 8 estimate via a $x_t$ 9 factor.

This structure eliminates all learning-rate and decay hyperparameters except for the moments and decay constants inherited from AdamW (Defazio, 18 May 2026).

2. Foundations in Averaging and Regret Analysis

ScheduleFree+ arises directly from minimax optimal weighted-regret bounds in stochastic optimization. For convex $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 0 with minimizer $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 1, the method guarantees

$z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 2

with optimality (in the canonical SGD setting) achieved when $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 3. As the true gradient norm is typically unavailable in nonconvex regimes, ScheduleFree+ employs an EMA proxy, as in the Polyak step-size heuristic (Defazio, 18 May 2026). Unlike fixed-schedule step methods, the averaging-driven contraction in the output $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 4 ensures $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 5 rates without horizon-dependent tuning.

On the practical side, ScheduleFree+ implements continuous interpolation between momenta and streaming average checkpoints, which empirically yields “anytime” optimization— $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 6 is the best available iterate at every step, addressing performance oscillations common in schedule-based methods.

3. Scaling and Robustness Enhancements

Despite strong small- and mid-scale performance, the original Schedule-Free AdamW exhibited loss divergence and instability at the scale of LLM training (e.g., batch sizes $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 7M tokens, $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 8B parameters). ScheduleFree+ addresses these by:

Restoring inner momentum $z_{t+1} = z_t - \eta \cdot G_t(y_t),$ 9– $G_t(y_t)$ 0 to suppress large-batch noise and cliff-like instabilities.
Switching to “fully decoupled” AdamC: Weight decay $G_t(y_t)$ 1 is applied via $G_t(y_t)$ 2 (not just $G_t(y_t)$ 3), enforcing stable weight and gradient norm evolution even as $G_t(y_t)$ 4 adapts.
Warm-starting the averaging buffer: $G_t(y_t)$ 5 for $G_t(y_t)$ 6 steps (typically $G_t(y_t)$ 7– $G_t(y_t)$ 8), after which averaging proceeds, preventing early norm collapse in $G_t(y_t)$ 9 and improving initial loss behavior.
Annealing outer momentum $y_t = (1-\beta)z_t + \beta x_t$ 0: Interpolated from $y_t = (1-\beta)z_t + \beta x_t$ 1 to $y_t = (1-\beta)z_t + \beta x_t$ 2 over the first 10–20% of epochs, balancing fast initial adaptation and later-stage smoothness.
Time-weighted averaging $y_t = (1-\beta)z_t + \beta x_t$ 3 with $y_t = (1-\beta)z_t + \beta x_t$ 4 for long runs (tokens-per-parameter $y_t = (1-\beta)z_t + \beta x_t$ 5B), $y_t = (1-\beta)z_t + \beta x_t$ 6 for short runs, empirically matched to the scale of LLM pretraining.

These changes collectively enable near-constant gradient norms, predictable effective step sizes, and robust convergence across a range of batch and model sizes, as demonstrated empirically (Defazio, 18 May 2026).

4. Empirical Evaluation and Comparative Performance

ScheduleFree+ has been benchmarked on scaling ladders comprising 120M to 1B parameter Llama-3 style transformers, with batch sizes up to 4M tokens and sequence lengths of 2K tokens. The optimizer is consistently competitive with, and often superior to, state-of-the-art Linear-Decay and Warmup-Stable-Decay (WSD) schedules.

Key empirical findings include:

Long runs (1000 tokens/parameter): ScheduleFree+ reaches the same validation loss as tuned Linear-Decay in 31% fewer tokens; Linear-Decay required 45% more tokens to reach parity at 120M scale.
Medium/short runs (100/20 tpp): ScheduleFree+ outperforms or matches WSD, is generally comparable to Linear-Decay except for very short runs on the largest models where prolonged drift in early steps is limiting.
Convergence predictability: The loss $y_t = (1-\beta)z_t + \beta x_t$ 7 fits the form $y_t = (1-\beta)z_t + \beta x_t$ 8 outside the initial 5% burn-in, enabling accurate stopping/horizon planning by early-stage curve fitting.
Loss curve behavior: The adaptive $y_t = (1-\beta)z_t + \beta x_t$ 9 schedule provides smooth, monotonic optimization progress with no imposed decay, in contrast to the stepwise fluctuations and plateaux seen in canonical schedules (Defazio, 18 May 2026).

5. Integration into Foundation Model Fine-Tuning and Physical Modeling

ScheduleFree-style methods have shown practical benefits beyond LLM pretraining:

In atomistic foundation modeling, ScheduleFree with AdamW-style preconditioning and automatic global scaling $x_{t+1}$ 0 achieves superior force root mean-squared error (RMSE) and robust molecular dynamics (MD) stability, on par or better than AdamW and LAMB, and strictly superior to SGD, RAdam, and Ranger, as rigorously benchmarked in energy/force accuracy and physical observable fidelity (Liu et al., 5 Dec 2025).
The preconditioning framework endows ScheduleFree with nearly perfect Hessian spectrum flattening ( $x_{t+1}$ 1), yielding improved convergence, lower friction along flat modes, and improved stability in strongly anisotropic landscapes, including solid–liquid interface dynamics and phonon spectra recovery.
ScheduleFree’s adaptation of Polyak averaging renders model averaging and checkpoint merging a theoretically principled procedure rather than a heuristic, supporting empirical gains found in recent “model soups” research (Defazio, 18 May 2026).

6. Hyperparameterization and Best Practices

ScheduleFree+ eliminates the need for hand-tuned learning-rate schedules, grid searches for decay strategies, or epoch-dependent scheduling:

Adam/AdamW parameters: Retains $x_{t+1}$ 2 for momentum/decay; no schedule hyperparameters.
Polyak EMA coefficient: $x_{t+1}$ 3 in $x_{t+1}$ 4 denominator suffices.
Average warmup: $x_{t+1}$ 5 warmup steps (empirically $x_{t+1}$ 6).
Weight decay $x_{t+1}$ 7: Used at $x_{t+1}$ 8– $x_{t+1}$ 9 standard values, always applied as $x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},$ 0.
No learning-rate grid search: Step-size $x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},$ 1 is computed online per iteration; users set only optimizer defaults (Defazio, 18 May 2026).
Separate $x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},$ 2 for evaluation: $x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},$ 3 yields the best-so-far model; $x_{t+1} = (1 - c_{t+1}) x_t + c_{t+1} z_{t+1},$ 4 is not used in evaluation.

In fine-tuning workflows, ScheduleFree can be used as a drop-in replacement for AdamW with minimal changes. For strongly heterogeneous or highly anisotropic domains, a brief second-order refinement (e.g., L-BFGS post-processing) after ScheduleFree optimization is recommended but not required for homogeneous tasks (Liu et al., 5 Dec 2025).

7. Broader Impact and Methodological Significance

ScheduleFree+ exemplifies a principled shift away from heuristic, horizon-dependent schedule design toward theoretically grounded, uniformly convergent, and robust optimizer construction. By unifying Polyak heuristic steps, online-to-batch averaging, and advanced momentum design in a learning-rate-free framework, ScheduleFree+ simplifies practical optimization, reduces parameter-sensitivity, and enhances convergence predictability in both deep learning and scientific modeling contexts.

The approach is especially well-suited for scenarios with unpredictable compute budgets, massive model scales, or environments demanding anytime stop-ability and stability across diverse tasks—including but not limited to LLM pretraining and foundation model fine-tuning across scientific domains (Defazio, 18 May 2026, Liu et al., 5 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (3)

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models (2026)

The Road Less Scheduled (2024)

Beyond Adam: Disentangling Optimizer Effects in the Fine-Tuning of Atomistic Foundation Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ScheduleFree+.