Papers
Topics
Authors
Recent
Search
2000 character limit reached

Horizon-Free Learning-Rate Schedules

Updated 18 February 2026
  • Horizon-free learning-rate schedules are methods that compute the learning rate using only past and current training data without needing a pre-specified total number of iterations.
  • They improve upon traditional cosine or exponential decay schedules by enabling anytime convergence and near-minimax performance in overparameterized and dynamic training environments.
  • Applications span large-scale language models, computer vision, and reinforcement learning, where these adaptable schedules simplify hyperparameter tuning and manage unknown training durations effectively.

Horizon-free learning-rate schedules comprise a family of approaches to learning rate adaptation in stochastic optimization that do not require advance specification of the total training horizon. These methods contrast with conventional horizon-aware schedules (e.g., cosine annealing, exponential decay), which critically depend on knowledge of the final iteration or epoch count. Horizon-free schedules have attracted significant attention due to the rise of continual, open-ended, and large-scale training paradigms, where the training horizon is unbounded, unknown, or subject to dynamic extension. This article provides a rigorous overview of the mathematical foundations, principal algorithmic strategies, theoretical guarantees, limitations, and empirical performance of horizon-free learning-rate schedules.

1. Mathematical Foundations of Horizon-Free Schedules

The core mathematical property of a horizon-free schedule is that the learning rate at each step tt, denoted ηt\eta_t, is computable using only information available up to tt, without reference to a pre-specified TT or NN (the total planned steps/epochs). Classical horizon-aware schedules are typically parametric functions of t/Tt/T (e.g., ηt=η0cos(πt/T)\eta_t = \eta_0 \cos(\pi t / T)), and fail to generalize when TT is misestimated or extended.

Horizon-free schedules may be characterized as:

  • Deterministic decay families: Polynomial decays, e.g., ηt=Ctα\eta_t = C t^{-\alpha} with α(0,1)\alpha \in (0,1), as shown to be minimax-optimal in the tail-averaged sense for overparameterized linear regression when paired with weight averaging (Meterez et al., 3 Feb 2026).
  • Adaptive state-based rules: Schedulers where ηt\eta_t0 is a function of observable metrics (instantaneous loss, weight norm, validation error, etc.), such as the GreedyLR scheduler which adjusts ηt\eta_t1 up or down in response to minibatch loss changes (Subramanian et al., 16 Dec 2025), or ABEL which decays the learning rate when "bounces" in total weight norm are detected (Lewkowycz, 2021).
  • Meta-learned context-aware mappings: Parameterized predictors (e.g., LSTM-based meta-schedulers as in MLR-SNet (Shu et al., 2020)) or ODE-driven models which select ηt\eta_t2 based on learned representations of recent training dynamics (Sampson et al., 27 Sep 2025).

The theoretical frameworks addressing horizon-freeness clarify the distinction between guarantees for tail-averaged iterates (Polyak–Juditsky averaging) and final-iterate performance, with the former admitting truly anytime minimax rates under horizon-free polynomial decay (Meterez et al., 3 Feb 2026, Bordelon et al., 4 Feb 2026) but fundamental barriers for the latter (Ge et al., 2019).

2. Principal Algorithmic Strategies

Several classes of horizon-free schedules have been systematically instantiated:

(a) Anytime Polynomial Decay Plus Averaging:

SGD with step sizes ηt\eta_t3, ηt\eta_t4, combined with tail-weight averaging:

  • Achieves near-minimax rates for overparameterized settings if ηt\eta_t5 is tuned to problem regularity (source and capacity exponents) (Meterez et al., 3 Feb 2026).
  • In large LLM pretraining, constant step-size plus EMA averaging (ηt\eta_t6–ηt\eta_t7) or ηt\eta_t8 decay plus EMA matches optimized cosine schedules across 1×–32× Chinchilla scale.
  • Pseudocode example (tail-averaged SGD): t/Tt/T6

(b) Greedy Loss-Adaptive Scheduling (GreedyLR):

Dynamically multiplies/divides the learning rate by a factor ηt\eta_t9 if the loss increases/decreases from the previous step:

  • Requires only per-step loss comparison and one multiplication or division.
  • Provably achieves tt0 convergence in tt1-smooth convex objectives, with tt2 optimal under tt3-smoothness (Subramanian et al., 16 Dec 2025).

(c) Weight-Norm–Triggered Decay (ABEL):

Monitors the evolution of the total squared weight-norm and executes learning rate decay precisely when a minimum (“bounce”) is detected:

  • Fully data-driven; never requires knowledge of tt4.
  • In regimes where the bounce is present (e.g., vision with tt5 regularization), matches or exceeds finely tuned schedules; otherwise defaults to "decay once at end" (Lewkowycz, 2021).

(d) Meta-Learned and Online Hypergradient Schedules:

Meta-learned parametric maps or ODE models predict tt6 based on the current loss trajectory and internal recurrent state:

  • MLR-SNet uses a single-layer LSTM mapping recent loss and hidden state to tt7 (Shu et al., 2020).
  • Latent ODE schedulers encode recent metrics into latent space, solve a neural ODE, and decode both immediate and long-range schedule recommendations; such models are entirely agnostic to horizon and optimize for long-term generalization rather than pre-specified endpoints (Sampson et al., 27 Sep 2025).

(e) Hyperbolic and Epoch-Insensitive Schedules:

HyperbolicLR and ExpHyperbolicLR use hyperbolic curve parametrizations to produce schedules whose early and mid-phase decay slopes are asymptotically independent of the epoch budget tt8 (Kim, 2024):

  • Fitting {init, min, slope} parameters on a small budget and then scaling up tt9 gives consistent performance without retuning, crucial for deployments with unknown or frequently changing horizon.

3. Theoretical Properties and Limitations

The minimax convergence of horizon-free schedules is contingent on how performance is measured:

  • Tail-Averaged Guarantees:

For overparameterized linear models, polynomial decay with tail or EMA averaging achieves horizon-free minimax excess error rates TT0 if TT1 (source exponent dominates capacity) (Meterez et al., 3 Feb 2026). The exponent TT2 is theoretically optimal and admits a simple, closed-form schedule.

  • Final-Iterate Barriers:

For the canonical least squares problem, it is shown that—without explicit horizon-dependent tuning—no horizon-free polynomial decay achieves the minimax final-iterate rate for all TT3; at infinitely many TT4 the excess risk is sub-optimal by a factor of TT5 (for the strongly convex case), and only geometric (step-decay) schedules with tuned steps per epoch can close the gap, but these require TT6 (Ge et al., 2019).

  • Adaptive and Meta-Learned Schedules:

When schedules are learned (via bilevel optimization or meta-learning) without any prior on TT7, uniform generalization bounds can be established on the learned schedule vector TT8 over tasks, with sample complexity TT9 for piecewise-polynomial and Pfaffian objectives. This analysis holds even for non-convex, non-smooth settings and does not require NN0 to enter the optimization as an argument (Sharma, 4 Dec 2025).

  • Loss-Adaptive and State-Based Approaches:

GreedyLR and ABEL function independently of the horizon and can adapt instantly to unexpected schedule extensions, restarts, or early stopping—properties not available for any NN1-parametric scheduler (Subramanian et al., 16 Dec 2025, Lewkowycz, 2021).

4. Empirical Performance and Applications

Empirical studies spanning language modeling, computer vision, time series, operator learning, and reinforcement learning indicate the following trends:

  • Anytime Polynomial/EWA Schedules:

On LLM pretraining (OLMo 150M/300M, C4), constant plus EMA and NN2 plus EMA track the best-tuned cosine decay envelope within NN3–NN4 validation loss across 1–32NN5 Chinchilla scale, also outperforming at very large batch sizes (Meterez et al., 3 Feb 2026).

  • GreedyLR:

Surpasses or matches classic cosine and exponential decay on NLP, vision, and LLM tasks (up to 7B params), is robust to noise, and recovers from loss spikes more quickly (Subramanian et al., 16 Dec 2025).

  • ABEL:

In weight-norm bouncing regimes, matches or slightly outperforms step-wise and cosine on ImageNet, CIFAR-10, and large-scale NLP. In "non-bounce" regimes, reduces to optimal one-shot decay at end (Lewkowycz, 2021).

  • Meta-Learned/Learned Schedules:

MLR-SNet demonstrates strong transfer to unseen horizons, architectures, and data domains with test performance comparable or superior to the best tuned static schedules, including in robustness to corrupted data scenarios (Shu et al., 2020). Latent ODE schedulers consistently improve over baseline and hypergradient approaches in test accuracy and find flatter minima on diverse datasets (Sampson et al., 27 Sep 2025).

  • Hyperbolic/Epoch-Insensitive Schedules:

HyperbolicLR and ExpHyperbolicLR deliver near-constant initial decay and stable performance as the epoch budget increases fourfold, reducing the need for retuning and maintaining curve shape across deployments (Kim, 2024).

5. Relationship to Horizon-Aware and Averaging Strategies

A major technical insight is the centrality of iterate averaging for achieving minimax rates in the absence of a horizon.

  • Polyak–Juditsky averaging makes polynomial decays anytime-minimax for overparameterized or interpolating models (Meterez et al., 3 Feb 2026, Bordelon et al., 4 Feb 2026).
  • Final-instant schemes cannot, in general, replicate this property: for horizon-freeness with final-iterate metrics, step-decay with epoch boundaries tuned to NN6 is required, but is no longer anytime (Ge et al., 2019).
  • Hypergradients, bilevel, and ODE-based methods can adapt on-the-fly but often trade off interpretability or computational simplicity for flexibility and transfer.
  • MLR-SNet and related meta-learners generalize across both horizon and task, parameterizing the schedule as a small recurrent map of local observables, and can be meta-trained to be plug-and-play out-of-the-box for novel deployments (Shu et al., 2020).

6. Practical Recommendations and Deployment Guidelines

For open-ended, continual, or plug-and-play machine learning deployments, the literature recommends:

  • Avoid horizon-tuned cosine, polynomial, or geometric decays when NN7 is unknown or variable.
    • For standard SGD: use NN8 or constant NN9 (lightly grid-tuned), always with EMA or tail averaging with half-life proportional to t/Tt/T0–t/Tt/T1 (Meterez et al., 3 Feb 2026).
    • For loss- or state-adaptive approaches: GreedyLR (t/Tt/T2–t/Tt/T3), ABEL (bounce-based trigger), or MLR-SNet (meta-learned) provide robust defaults.
    • For tasks requiring rapid scaling of epochs or unpredictable computation budgets: HyperbolicLR/ExpHyperbolicLR, tuned on small t/Tt/T4 and transferred without retuning (Kim, 2024).
  • Overhead of additional statistics (EMA, loss smoothing, internal RNN state) is t/Tt/T5 per step, negligible relative to compute- and memory-bound workloads.
  • Horizon-free schemes reduce the need for checkpointing, retuning, and early stopping heuristics, and integrate cleanly with contemporary ML pipeline infrastructure.

7. Limitations, Open Problems, and Frontiers

  • Final-iterate minimaxity is unattainable in full generality for horizon-free scalar schedules without averaging: a \emph{suboptimality gap} is impossible to eliminate in strongly convex streaming regression (Ge et al., 2019).
  • Extensions to other optimizers, e.g. AdamW or momentum SGD, are empirically validated, but precise universal theoretical characterization beyond SGD is partially open (Meterez et al., 3 Feb 2026).
  • Meta-learned and state-based approaches depend on training data diversity: while robust in benchmark evaluations, their horizon-agnostic property is ultimately empirical in generalization to new domains (Sampson et al., 27 Sep 2025, Shu et al., 2020).
  • Bilevel and ODE-based algorithms have higher per-iteration cost, though recent advances suggest these are manageable in practice (Sampson et al., 27 Sep 2025).
  • Theoretical guarantees for online hypergradient-based schedulers (e.g., MARTHE) exist for stability under natural conditions but lack formal rate-optimality proofs (Donini et al., 2019).

A plausible implication is that the future of horizon-free scheduling will increasingly blend lightweight, minimax polynomial/EMA protocols for core optimization with task-adaptive or meta-learned controllers for long-horizon, heterogeneous, or nonstationary training regimes. Emerging open questions include:

  • Can one close the minimax final-iterate gap for horizon-free schemes without averaging, possibly via new algorithmic primitives (e.g., selective checkpointing, hybrid geometric–state triggers)?
  • How robust are ODE/meta-learned schedulers to adversarial or regime-shifting data, and what are the lower bounds on their sample complexity to attain generalization across both model and horizon?

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Horizon-Free Learning-Rate Schedules.