Horizon-Free Learning-Rate Schedules
- Horizon-free learning-rate schedules are methods that compute the learning rate using only past and current training data without needing a pre-specified total number of iterations.
- They improve upon traditional cosine or exponential decay schedules by enabling anytime convergence and near-minimax performance in overparameterized and dynamic training environments.
- Applications span large-scale language models, computer vision, and reinforcement learning, where these adaptable schedules simplify hyperparameter tuning and manage unknown training durations effectively.
Horizon-free learning-rate schedules comprise a family of approaches to learning rate adaptation in stochastic optimization that do not require advance specification of the total training horizon. These methods contrast with conventional horizon-aware schedules (e.g., cosine annealing, exponential decay), which critically depend on knowledge of the final iteration or epoch count. Horizon-free schedules have attracted significant attention due to the rise of continual, open-ended, and large-scale training paradigms, where the training horizon is unbounded, unknown, or subject to dynamic extension. This article provides a rigorous overview of the mathematical foundations, principal algorithmic strategies, theoretical guarantees, limitations, and empirical performance of horizon-free learning-rate schedules.
1. Mathematical Foundations of Horizon-Free Schedules
The core mathematical property of a horizon-free schedule is that the learning rate at each step , denoted , is computable using only information available up to , without reference to a pre-specified or (the total planned steps/epochs). Classical horizon-aware schedules are typically parametric functions of (e.g., ), and fail to generalize when is misestimated or extended.
Horizon-free schedules may be characterized as:
- Deterministic decay families: Polynomial decays, e.g., with , as shown to be minimax-optimal in the tail-averaged sense for overparameterized linear regression when paired with weight averaging (Meterez et al., 3 Feb 2026).
- Adaptive state-based rules: Schedulers where 0 is a function of observable metrics (instantaneous loss, weight norm, validation error, etc.), such as the GreedyLR scheduler which adjusts 1 up or down in response to minibatch loss changes (Subramanian et al., 16 Dec 2025), or ABEL which decays the learning rate when "bounces" in total weight norm are detected (Lewkowycz, 2021).
- Meta-learned context-aware mappings: Parameterized predictors (e.g., LSTM-based meta-schedulers as in MLR-SNet (Shu et al., 2020)) or ODE-driven models which select 2 based on learned representations of recent training dynamics (Sampson et al., 27 Sep 2025).
The theoretical frameworks addressing horizon-freeness clarify the distinction between guarantees for tail-averaged iterates (Polyak–Juditsky averaging) and final-iterate performance, with the former admitting truly anytime minimax rates under horizon-free polynomial decay (Meterez et al., 3 Feb 2026, Bordelon et al., 4 Feb 2026) but fundamental barriers for the latter (Ge et al., 2019).
2. Principal Algorithmic Strategies
Several classes of horizon-free schedules have been systematically instantiated:
(a) Anytime Polynomial Decay Plus Averaging:
SGD with step sizes 3, 4, combined with tail-weight averaging:
- Achieves near-minimax rates for overparameterized settings if 5 is tuned to problem regularity (source and capacity exponents) (Meterez et al., 3 Feb 2026).
- In large LLM pretraining, constant step-size plus EMA averaging (6–7) or 8 decay plus EMA matches optimized cosine schedules across 1×–32× Chinchilla scale.
- Pseudocode example (tail-averaged SGD): 6
(b) Greedy Loss-Adaptive Scheduling (GreedyLR):
Dynamically multiplies/divides the learning rate by a factor 9 if the loss increases/decreases from the previous step:
- Requires only per-step loss comparison and one multiplication or division.
- Provably achieves 0 convergence in 1-smooth convex objectives, with 2 optimal under 3-smoothness (Subramanian et al., 16 Dec 2025).
(c) Weight-Norm–Triggered Decay (ABEL):
Monitors the evolution of the total squared weight-norm and executes learning rate decay precisely when a minimum (“bounce”) is detected:
- Fully data-driven; never requires knowledge of 4.
- In regimes where the bounce is present (e.g., vision with 5 regularization), matches or exceeds finely tuned schedules; otherwise defaults to "decay once at end" (Lewkowycz, 2021).
(d) Meta-Learned and Online Hypergradient Schedules:
Meta-learned parametric maps or ODE models predict 6 based on the current loss trajectory and internal recurrent state:
- MLR-SNet uses a single-layer LSTM mapping recent loss and hidden state to 7 (Shu et al., 2020).
- Latent ODE schedulers encode recent metrics into latent space, solve a neural ODE, and decode both immediate and long-range schedule recommendations; such models are entirely agnostic to horizon and optimize for long-term generalization rather than pre-specified endpoints (Sampson et al., 27 Sep 2025).
(e) Hyperbolic and Epoch-Insensitive Schedules:
HyperbolicLR and ExpHyperbolicLR use hyperbolic curve parametrizations to produce schedules whose early and mid-phase decay slopes are asymptotically independent of the epoch budget 8 (Kim, 2024):
- Fitting {init, min, slope} parameters on a small budget and then scaling up 9 gives consistent performance without retuning, crucial for deployments with unknown or frequently changing horizon.
3. Theoretical Properties and Limitations
The minimax convergence of horizon-free schedules is contingent on how performance is measured:
- Tail-Averaged Guarantees:
For overparameterized linear models, polynomial decay with tail or EMA averaging achieves horizon-free minimax excess error rates 0 if 1 (source exponent dominates capacity) (Meterez et al., 3 Feb 2026). The exponent 2 is theoretically optimal and admits a simple, closed-form schedule.
- Final-Iterate Barriers:
For the canonical least squares problem, it is shown that—without explicit horizon-dependent tuning—no horizon-free polynomial decay achieves the minimax final-iterate rate for all 3; at infinitely many 4 the excess risk is sub-optimal by a factor of 5 (for the strongly convex case), and only geometric (step-decay) schedules with tuned steps per epoch can close the gap, but these require 6 (Ge et al., 2019).
- Adaptive and Meta-Learned Schedules:
When schedules are learned (via bilevel optimization or meta-learning) without any prior on 7, uniform generalization bounds can be established on the learned schedule vector 8 over tasks, with sample complexity 9 for piecewise-polynomial and Pfaffian objectives. This analysis holds even for non-convex, non-smooth settings and does not require 0 to enter the optimization as an argument (Sharma, 4 Dec 2025).
- Loss-Adaptive and State-Based Approaches:
GreedyLR and ABEL function independently of the horizon and can adapt instantly to unexpected schedule extensions, restarts, or early stopping—properties not available for any 1-parametric scheduler (Subramanian et al., 16 Dec 2025, Lewkowycz, 2021).
4. Empirical Performance and Applications
Empirical studies spanning language modeling, computer vision, time series, operator learning, and reinforcement learning indicate the following trends:
- Anytime Polynomial/EWA Schedules:
On LLM pretraining (OLMo 150M/300M, C4), constant plus EMA and 2 plus EMA track the best-tuned cosine decay envelope within 3–4 validation loss across 1–325 Chinchilla scale, also outperforming at very large batch sizes (Meterez et al., 3 Feb 2026).
- GreedyLR:
Surpasses or matches classic cosine and exponential decay on NLP, vision, and LLM tasks (up to 7B params), is robust to noise, and recovers from loss spikes more quickly (Subramanian et al., 16 Dec 2025).
- ABEL:
In weight-norm bouncing regimes, matches or slightly outperforms step-wise and cosine on ImageNet, CIFAR-10, and large-scale NLP. In "non-bounce" regimes, reduces to optimal one-shot decay at end (Lewkowycz, 2021).
- Meta-Learned/Learned Schedules:
MLR-SNet demonstrates strong transfer to unseen horizons, architectures, and data domains with test performance comparable or superior to the best tuned static schedules, including in robustness to corrupted data scenarios (Shu et al., 2020). Latent ODE schedulers consistently improve over baseline and hypergradient approaches in test accuracy and find flatter minima on diverse datasets (Sampson et al., 27 Sep 2025).
- Hyperbolic/Epoch-Insensitive Schedules:
HyperbolicLR and ExpHyperbolicLR deliver near-constant initial decay and stable performance as the epoch budget increases fourfold, reducing the need for retuning and maintaining curve shape across deployments (Kim, 2024).
5. Relationship to Horizon-Aware and Averaging Strategies
A major technical insight is the centrality of iterate averaging for achieving minimax rates in the absence of a horizon.
- Polyak–Juditsky averaging makes polynomial decays anytime-minimax for overparameterized or interpolating models (Meterez et al., 3 Feb 2026, Bordelon et al., 4 Feb 2026).
- Final-instant schemes cannot, in general, replicate this property: for horizon-freeness with final-iterate metrics, step-decay with epoch boundaries tuned to 6 is required, but is no longer anytime (Ge et al., 2019).
- Hypergradients, bilevel, and ODE-based methods can adapt on-the-fly but often trade off interpretability or computational simplicity for flexibility and transfer.
- MLR-SNet and related meta-learners generalize across both horizon and task, parameterizing the schedule as a small recurrent map of local observables, and can be meta-trained to be plug-and-play out-of-the-box for novel deployments (Shu et al., 2020).
6. Practical Recommendations and Deployment Guidelines
For open-ended, continual, or plug-and-play machine learning deployments, the literature recommends:
- Avoid horizon-tuned cosine, polynomial, or geometric decays when 7 is unknown or variable.
- For standard SGD: use 8 or constant 9 (lightly grid-tuned), always with EMA or tail averaging with half-life proportional to 0–1 (Meterez et al., 3 Feb 2026).
- For loss- or state-adaptive approaches: GreedyLR (2–3), ABEL (bounce-based trigger), or MLR-SNet (meta-learned) provide robust defaults.
- For tasks requiring rapid scaling of epochs or unpredictable computation budgets: HyperbolicLR/ExpHyperbolicLR, tuned on small 4 and transferred without retuning (Kim, 2024).
- Overhead of additional statistics (EMA, loss smoothing, internal RNN state) is 5 per step, negligible relative to compute- and memory-bound workloads.
- Horizon-free schemes reduce the need for checkpointing, retuning, and early stopping heuristics, and integrate cleanly with contemporary ML pipeline infrastructure.
7. Limitations, Open Problems, and Frontiers
- Final-iterate minimaxity is unattainable in full generality for horizon-free scalar schedules without averaging: a \emph{suboptimality gap} is impossible to eliminate in strongly convex streaming regression (Ge et al., 2019).
- Extensions to other optimizers, e.g. AdamW or momentum SGD, are empirically validated, but precise universal theoretical characterization beyond SGD is partially open (Meterez et al., 3 Feb 2026).
- Meta-learned and state-based approaches depend on training data diversity: while robust in benchmark evaluations, their horizon-agnostic property is ultimately empirical in generalization to new domains (Sampson et al., 27 Sep 2025, Shu et al., 2020).
- Bilevel and ODE-based algorithms have higher per-iteration cost, though recent advances suggest these are manageable in practice (Sampson et al., 27 Sep 2025).
- Theoretical guarantees for online hypergradient-based schedulers (e.g., MARTHE) exist for stability under natural conditions but lack formal rate-optimality proofs (Donini et al., 2019).
A plausible implication is that the future of horizon-free scheduling will increasingly blend lightweight, minimax polynomial/EMA protocols for core optimization with task-adaptive or meta-learned controllers for long-horizon, heterogeneous, or nonstationary training regimes. Emerging open questions include:
- Can one close the minimax final-iterate gap for horizon-free schemes without averaging, possibly via new algorithmic primitives (e.g., selective checkpointing, hybrid geometric–state triggers)?
- How robust are ODE/meta-learned schedulers to adversarial or regime-shifting data, and what are the lower bounds on their sample complexity to attain generalization across both model and horizon?
References:
- "Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging" (Meterez et al., 3 Feb 2026)
- "Dynamic Learning Rate Scheduling based on Loss Changes Leads to Faster Convergence" (Subramanian et al., 16 Dec 2025)
- "How to decay your learning rate" (ABEL) (Lewkowycz, 2021)
- "MARTHE: Scheduling the Learning Rate Via Online Hypergradients" (Donini et al., 2019)
- "HyperbolicLR: Epoch insensitive learning rate scheduler" (Kim, 2024)
- "Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model" (Bordelon et al., 4 Feb 2026)
- "Dynamics of Learning: Generative Schedules from Latent ODEs" (Sampson et al., 27 Sep 2025)
- "Gradient Descent with Provably Tuned Learning-rate Schedules" (Sharma, 4 Dec 2025)
- "MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks" (Shu et al., 2020)
- "The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares" (Ge et al., 2019)