ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
Abstract: Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training LLMs which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
ScheduleFree+: Training Big LLMs without Learning-Rate Schedules — A Simple Guide
What’s this paper about? (Overview)
This paper introduces ScheduleFree+, a new way to train LLMs without using the usual “learning-rate schedules.” Instead of slowly turning the training “speed knob” up and down over time, the method relies on smart averaging and automatic step-size rules so training stays smooth, stable, and easy to control. The goal is to make training more predictable and faster to reach good performance—especially for long trainings.
What questions are the researchers trying to answer? (Objectives)
The paper asks:
- Can we train big LLMs well without carefully tuned learning-rate schedules?
- How do we make schedule-free training stable for very large batches of data and very long training runs?
- Can we design a method that needs little to no learning-rate tuning and still beats strong baselines?
- Can we make progress predictable so we can plan training time better?
How does the method work? (Methods, explained simply)
Think of training a model like learning to ride a bike on a bumpy road:
- Traditional training uses a “speed schedule” (the learning rate) that starts high, then slowly lowers—like carefully changing your biking speed over time.
- ScheduleFree+ says: instead of constantly adjusting speed, ride at a steady pace but keep a smooth “average of your recent paths” to guide where to go next. That average helps avoid zig-zagging.
Here are the main ingredients, translated into everyday language:
- Averaging your progress (Schedule-Free Learning): The method keeps two versions of the model:
- The “latest step” model (fast, noisy, reacts quickly).
- A “running average” model (slow, stable, smooth).
- It learns using a mix of both, so training stays steady and predictable.
- Inner momentum: Like giving your motion a gentle push that smooths bumps, momentum helps especially when you process large batches of data at once.
- Automatic step-size (Polyak step size): Instead of picking a learning rate by hand, the method estimates how big each step should be from simple signals like the current loss and the size of the gradients (think: how steep the hill is). It uses smoothed measurements so it doesn’t get fooled by noise.
- Weight decay that plays nicely with adaptivity (AdamC): Weight decay is a gentle pull that keeps model weights from growing too large. The paper adjusts it so it still works well when the step-size changes automatically.
- Warm starts and warmups: Early in training, models change a lot. ScheduleFree+ waits a bit before averaging kicks in and slowly ramps some settings (“warmup”), so the start is stable and fast.
- Beta annealing (tuning the averaging over time): Early on, the method leans more on fresh steps; later, it leans more on the stable average. This balances fast early progress with strong long-term results.
- Smart weighting of checkpoints (r = 1): When averaging over the training history, it gives a bit more weight to later checkpoints, which helps on long runs.
In short: ScheduleFree+ replaces complex learning-rate schedules with a smarter way to mix “fresh” and “stable” model versions, adds momentum for robustness, uses automatic step sizes, and tweaks weight decay so all of this plays well together.
What did they test and what did they find? (Main results)
The authors trained LLMs of different sizes (from ~120 million to ~1 billion parameters and beyond) across different training lengths. They compared ScheduleFree+ to popular baselines like:
- Linear Decay schedules (a strong, well-tuned classic)
- WSD (Warmup–Stable–Decay), which stays flat then gently decays at the end
Key findings:
- Better for long training: For very long runs (about 1000 “tokens per parameter”—a modern, large budget), ScheduleFree+ reached the same quality 31% faster than a strong schedule. That’s a big saving in time and compute.
- Scales to large batches: By bringing back inner momentum, the method stays stable even when using very large batches (lots of data per step), which is important for fast training on big clusters.
- Learning-rate-free in practice: Using the Polyak step size and inverse-gradient norms, the method picks step sizes automatically and works across scales without hand-tuning.
- Stable weights and gradients: With the AdamC-style weight decay and step-size rules, ScheduleFree+ avoids “drift” where gradients or weights get too big or too small over time. This stability helps long runs and makes models easier to handle later.
- Predictable progress: After a short “burn-in” at the start, the validation loss follows a very smooth, predictable curve (it drops roughly like 1/sqrt(time)). This predictability lets you forecast how much longer training will need to reach a target quality.
- Shorter runs: On shorter trainings (like 20–100 tokens per parameter), results are mixed but still strong. ScheduleFree+ often beats WSD and is competitive with tuned linear decay; its biggest advantage shows up as runs get longer.
Why does this matter? (Implications and impact)
- Less tuning, more reliability: Not having to painstakingly tune learning-rate schedules saves time and reduces the risk of failed runs. That’s huge for large, expensive LLM trainings.
- Faster to target quality in long runs: A 31% time reduction to hit the same loss means real cost and energy savings at scale.
- Scales to modern setups: The method works with large batch sizes and large models—what big labs use today.
- “Anytime” training: Because averaging keeps things smooth, you can stop the run at any time and still have a good, stable model—no need to wait for a special “decay phase.”
- Supports model merging/averaging: The theory behind Schedule-Free Learning provides a foundation for combining checkpoints or averaged models during pretraining, which can help build better models or recover from interruptions more easily.
- Predictable planning: If you can fit and forecast loss early, you can better plan how much compute you need, compare strategies sooner, and avoid wasted training.
In everyday terms: ScheduleFree+ is like switching from carefully twisting a speed dial the whole time to setting a steady pace with a smart autopilot that smooths your path. It gets you to your destination more reliably, often faster, and with less guesswork—especially on long journeys.
Knowledge Gaps
Below is a single, focused list of concrete knowledge gaps, limitations, and open questions that remain unresolved in the paper; each point highlights what is missing or uncertain and suggests where follow‑up work could act.
- Generalization beyond one data/architecture setting is untested: results are almost entirely on Llama‑style transformers trained on FineWeb‑EDU; performance on other domains (code, multilingual, math), datasets, and model families (MoE, normalization‑free architectures, convolutional/vision transformers, diffusion models) is unknown.
- Scaling to frontier model sizes is not demonstrated: aside from a few figures up to 2B and scaling ladders up to ~1B parameters, there is no evidence for ≥7B/13B/70B+ models or for multi‑trillion token runs common in current practice.
- Throughput and systems overhead are not quantified: maintaining the x‑buffer, extra global reductions (loss and global L1 norm), and Polyak EMAs may reduce tokens/sec; wall‑clock efficiency vs. baselines and compatibility with ZeRO, pipeline/tensor parallelism, and gradient checkpointing are not reported.
- Memory footprint trade‑offs are unreported: re‑introducing inner momentum and keeping an averaged copy x increases state; memory overhead at large scales and interaction with optimizer state sharding is not measured.
- Dependence on warm starts is underexplored: large‑batch experiments warm‑start from a 2B‑token checkpoint; cold‑start behavior at the same batch sizes and fairness of comparisons without warm‑starts are not evaluated.
- Baseline coverage is incomplete: comparisons are limited to AdamW with Linear Decay and WSD; no comparisons with strong adaptive or large‑batch methods (Adafactor, LAMB/LARS, Shampoo, D‑Adaptation, Prodigy, DoG/DoWG, Muon), or modern cosine variants with restarts.
- Short‑horizon underperformance is unresolved: ScheduleFree+ lags Linear Decay at 20 tokens‑per‑parameter for larger models; proposed fixes (weight projection, better initialization to steady‑state norms) are suggested but not implemented or evaluated.
- Downstream/task‑level evaluation is missing: results focus on validation loss; effects on standard LLM benchmarks (e.g., MMLU, HellaSwag, GSM8K, multilingual suites), calibration, and generation quality are unknown.
- Robustness across data non‑stationarity is untested: schedule‑free weighting and Polyak step rely on stable global gradients; behavior under curriculum, domain shifts, and mixed‑domain sampling is not studied.
- Stability across seeds and statistical significance are not reported: variance across runs and confidence intervals for loss improvements are absent.
- Very large batch regimes remain unclear: improvements are shown up to 8M tokens/batch; behavior with even larger effective batches and with gradient accumulation/pipeline bubble effects is unreported.
- Interaction with gradient clipping is not discussed: whether ScheduleFree+ requires, benefits from, or is hindered by gradient clipping (global or per‑layer) is not evaluated.
- Per‑layer vs. global adaptivity is unexplored: the method uses a single global L1‑norm; potential gains from per‑layer/per‑block inverse‑norm weighting or Polyak denominators (to handle heterogeneous layer scales) are not studied.
- Heavy‑tailed gradient effects are unexamined: the L1‑norm approximation relies on a normality assumption; robustness to heavy‑tailed or skewed gradient distributions common in LLMs is not analyzed.
- Statistical independence requirement is relaxed without guarantees: Schedule‑Free theory requires weights independent of current noise; using EMA‑based gradient norms breaks this assumption, but the impact on convergence in non‑convex, stochastic settings is not theoretically clarified.
- Non‑convex theory is missing: claims and intuitions rely on convex analyses; there are no convergence/stability guarantees for the proposed AdamC + Schedule‑Free + Polyak combination in non‑convex LLM training.
- AdamC with very large weight decay (γ ≈ 5–50) raises open questions: how to set γ automatically, sensitivity across architectures/datasets, and effects on generalization, layerwise norms, embeddings, and tied weights are unstudied.
- Interaction with normalization types is not probed: many arguments hinge on normalization layers; behavior with RMSNorm vs. LayerNorm, normalization‑free transformers, or alternative normalizers is not tested.
- Checkpoint merging/model averaging claims lack empirical validation: although a theoretical foundation is asserted, no experiments quantify benefits, merging schedules, or risks (e.g., catastrophic interference) during pretraining.
- Predictability (1/√t loss fit) is demonstrated on a few runs only: generality across datasets, scales, seeds, and training regimes—and its utility for budget allocation/model selection—is not systematically validated.
- Hyperparameter automation is incomplete: ScheduleFree+ still requires choices for warmup length, c‑warmup length, βinitial/βfinal, anneal schedule, r (averaging power), Polyak EMA βp, and weight decay γ; there is no sensitivity analysis or automatic adaptation strategy.
- Numerical/precision issues are not addressed: behavior under BF16/FP8 with dynamic loss scaling, susceptibility to NaNs/denominator blow‑ups, and numerical stability on very deep pipelines are not discussed.
- Impact on training instabilities and safety checks is unclear: how the method interacts with common safeguards (loss spikes, gradient anomalies, watchdog resets) in production training is unreported.
- Fairness of horizon tuning is ambiguous: Linear Decay is tuned for the target horizon while ScheduleFree+ is “anytime”; how results change under mismatched/unknown horizons or early stopping is not systematically explored.
- Compute‑efficiency vs. quality trade‑offs are not quantified: reported gains are in tokens to reach a loss; energy, cost, and wall‑clock reductions (or overheads) are not provided.
- Layerwise norm dynamics are only partially characterized: while global weight/gradient norms are analyzed, per‑layer norm evolution, especially in embeddings and output heads, and consequences for quantization and compression are not measured.
- Applicability to fine‑tuning and instruction/RLHF phases is unknown: all experiments are on pretraining; behavior during SFT, DPO/RLHF, and continual learning is untested.
- Integration with regularization/augmentation is untested: interactions with dropout, label smoothing, data augmentation, and token masking strategies are not evaluated.
- Sensitivity to β‑annealing schedule shape is not explored: the paper uses a log‑linear anneal from 0.8/0.9 to 0.965; alternative schedules and their robustness across horizons/datasets are not ablated.
- C‑refinement vs. inner momentum trade‑offs are not fully mapped: only a couple of C values are tested; combined use with momentum, dependence on batch size/model scale, and principled selection of C vs. r remain open.
- Failure modes and diagnostics are under‑specified: criteria to detect when ScheduleFree+ is mis‑configured (e.g., gradient norm drift not corrected, unstable Polyak steps) and recommended remedies are not provided.
Practical Applications
Immediate Applications
- Learning-rate-free LLM pretraining with ScheduleFree+ to cut tuning and time-to-target loss
- Deploy the provided AdamC + Schedule-Free + Polyak optimizer to replace cosine/WSD/linear schedules in long-horizon pretraining; empirical results show up to ~31% less training time to reach the same loss at ~1000 tokens-per-parameter.
- Sectors: software/AI, cloud, chips/HPC, energy.
- Tools/Workflows: plug-in optimizer module for PyTorch/DeepSpeed/FSDP; adopt the reference implementation; use “x” as the served/evaluated model while computing gradients at “y”.
- Assumptions/Dependencies: largest gains for long runs (≥100–1000 tpp); requires normalization layers; needs distributed all-reduce for global loss and L1-gradient norms; extra model buffer for x; modify training loop to evaluate at y and return x.
- Elimination of learning-rate grid searches in LLM pretraining
- Use the Polyak step-size (with the paper’s stable L1-gradient approximation) to set steps automatically across batch sizes and horizons, avoiding expensive LR sweeps.
- Sectors: software/AI, finance (budget control), energy (carbon/cost reduction).
- Tools/Workflows: bake Polyak into your optimizer wrapper; remove LR sweeps from AutoML pipelines.
- Assumptions/Dependencies: requires stable loss estimates; Polyak numerator uses f(y) with EMA smoothing; still use a brief warmup.
- Stable large-batch training via inner momentum
- Reintroduce “inner” momentum (e.g., β1≈0.75–0.9) to prevent scaling cliffs and enable larger global batches without divergence.
- Sectors: cloud/HPC, chip vendors.
- Tools/Workflows: turn momentum back on for Adam-like inner loop; validate batch scaling ladders up to your memory limits.
- Assumptions/Dependencies: momentum adds memory/compute; retune momentum slightly per setup.
- “Anytime” training, pause/resume, and continuous pretraining
- ScheduleFree+ models are usable at any point without anneal tails; enables seamless pause/resume, check-pointing, and continuous data streaming regimes.
- Sectors: software/AI (foundation model teams), MLOps.
- Tools/Workflows: remove end-of-run annealing; integrate frequent checkpointing; adopt streaming data pipelines.
- Assumptions/Dependencies: adjust infra to always evaluate/serve the x-average model.
- Predictable compute-to-loss planning and early-stop decisions
- Fit the observed loss to a c + a/√(t + b) form after burn-in to forecast final loss and tokens needed; use to terminate underperforming runs earlier and budget compute/carbon.
- Sectors: finance (FP&A), energy/sustainability, cloud capacity planning, academia.
- Tools/Workflows: add an early-curve-fit stage (e.g., 5–15% of run) to gate further spend; dashboards that show predicted loss vs. budget.
- Assumptions/Dependencies: works best after warmup; assumes stationary data distribution and stable gradient norms.
- Robust weight-decay with adaptive LR using AdamC
- Replace standard AdamW WD with decoupled, LR-corrected AdamC to prevent LR–WD feedback loops and stabilize weight/gradient norms (improves quantization readiness).
- Sectors: software/AI, edge inference (quantization), robotics (resource-constrained models).
- Tools/Workflows: switch to AdamC with larger WD values (often 5–50); monitor weight and gradient norms.
- Assumptions/Dependencies: rescale WD hyperparameter; verify interactions with other regularizers.
- Model averaging and checkpoint merging during pretraining
- Leverage the theoretical grounding of averaging: average x across time or runs to create more robust pretraining checkpoints (“checkpoint soups”).
- Sectors: software/AI, federated/collaborative R&D, academia.
- Tools/Workflows: periodic average of x-trajectory; weighted averaging with r=1 for long runs.
- Assumptions/Dependencies: consistent tokenizer/architectures; licensing/IP alignment if merging across orgs.
- Faster early convergence via warm-start and c-warmup
- Use warm-started checkpoints and delay introduce averaging (c-warmup ≈ 2× LR warmup steps) to accelerate early loss drops.
- Sectors: software/AI, academia.
- Tools/Workflows: standardize c-warmup; maintain a “steady-norm” warm-start checkpoint library.
- Assumptions/Dependencies: requires available warm-start checkpoints; benefit diminishes on very long runs.
- Beta annealing and r=1 weighted averaging for long runs
- Anneal β from ~0.8–0.9 to ~0.965 and use r=1 weighting to improve both early and late-stage performance in long-horizon training.
- Sectors: software/AI.
- Tools/Workflows: implement log-linear β schedule; switch averaging weights to r=1 for runs >~30B tokens.
- Assumptions/Dependencies: less useful for short runs (≤20 tpp).
- Monitoring/telemetry for gradient L1 norms and weight norms
- Add global L1-gradient norm and weight-norm dashboards and alerts to detect drift, confirm steady states, and validate effective LR behavior.
- Sectors: MLOps across industries (healthcare/finance/education models).
- Tools/Workflows: distributed reduction for L1 norms; EMA smoothing; integrate into observability stack.
- Assumptions/Dependencies: minimal comms overhead; careful treatment of outliers.
- Education and curriculum use in optimization courses
- Teach schedule-free averaging, Polyak step, and the 1/√t predictability as a modern bridge from convex theory to deep learning practice.
- Sectors: education, academia.
- Tools/Workflows: reference implementation and ablation scripts for labs.
- Assumptions/Dependencies: access to modest compute for demonstrations.
Long-Term Applications
- Default optimizer for frontier-scale (10B+ parameters) and multimodal training
- Mature ScheduleFree+ into the standard for very large language/multimodal models, replacing hand-tuned schedules.
- Sectors: software/AI, media, robotics, healthcare (domain-specific LLMs).
- Tools/Workflows: integrate into large-scale frameworks (Megatron, Alpa, DeepSpeed-MoE); extend to speech/vision/diffusion/RL.
- Assumptions/Dependencies: further validation on ultra-large models and diverse tasks.
- Hardware/firmware co-design for schedule-free optimization
- Fuse y-point evaluation, x-averaging, Polyak statistics, and L1-norm reductions into GPU/TPU/ASIC kernels to reduce overhead and improve throughput.
- Sectors: semiconductors, cloud providers.
- Tools/Workflows: NCCL/collective optimizations; compiler-level fusion; on-device EMA units.
- Assumptions/Dependencies: vendor adoption; standardization of optimizer APIs.
- Autonomous training controllers using predictable loss curves
- Closed-loop systems that auto-allocate compute, energy, and budget based on forecasted loss trajectories; dynamic cluster scheduling and power capping.
- Sectors: cloud/HPC, energy-aware data centers, finance (cost governance).
- Tools/Workflows: controllers that monitor c + a/√(t+b) fit and adjust job priority, checkpoints, and batch sizes.
- Assumptions/Dependencies: reliable early-curve forecasts; integration with schedulers (K8s/Slurm).
- Continuous pretraining and streaming data products
- Always-on model refresh pipelines where schedule-free “anytime” snapshots enable frequent, safe deployments and evaluation without anneal delays.
- Sectors: software/AI platforms, consumer apps, enterprise AI.
- Tools/Workflows: rolling checkpoints, canary deployment of x; streaming data validation tied to predictable improvement curves.
- Assumptions/Dependencies: robust data governance; evaluation gating.
- Federated and cross-organization pretraining via averaging
- Use averaging-friendly theory to merge pretraining progress from multiple parties (or silos) without sharing raw data.
- Sectors: healthcare, finance, public sector, cross-lab collaborations.
- Tools/Workflows: secure checkpoint exchange; weighted x-averaging; privacy-preserving auditing.
- Assumptions/Dependencies: alignment on architectures/tokenizers; legal/IP frameworks; robustness to data heterogeneity.
- Quantization- and compression-aware training with stable norms
- Co-train with ScheduleFree+ to keep weight/gradient norms steady, simplifying post-training quantization and pruning, enabling better on-device models.
- Sectors: edge AI, robotics, mobile.
- Tools/Workflows: integrate with QAT/pruning toolchains; tighter calibration windows enabled by stable norms.
- Assumptions/Dependencies: more experiments across compression regimes; task-dependent tuning.
- Domain extension beyond LLMs with minimal tuning
- Apply LR-free, schedule-free training to vision, speech, time-series, and reinforcement learning to reduce tuning burden in sectors with scarce expertise.
- Sectors: healthcare imaging, autonomous systems/robotics, finance (sequence models), education (edtech models).
- Tools/Workflows: task adapters for Polyak numerator estimation; domain-specific warmups and β schedules.
- Assumptions/Dependencies: verify behavior without normalization layers or with different normalizers.
- Compute-to-loss SLAs and procurement standards
- Create contracts and internal policies that peg spend to forecastable loss improvements, improving transparency and sustainability reporting.
- Sectors: policy/corporate governance, finance, energy/sustainability.
- Tools/Workflows: standardized loss-forecast metrics in RFPs; carbon-per-loss-point metrics.
- Assumptions/Dependencies: industry acceptance of predictability metrics; standardized benchmarks.
- Power- and grid-aware training profiles
- Exploit stable step-size behavior to schedule training around renewable availability and demand response, reducing carbon intensity without destabilizing learning.
- Sectors: energy, cloud/HPC.
- Tools/Workflows: integrate optimizer telemetry with energy orchestration; flexible batch sizing and pause/resume windows.
- Assumptions/Dependencies: grid integration; predictable behavior under variable power caps.
Cross-cutting Assumptions and Caveats
- Advantages are largest for long training horizons; on very short runs (e.g., ~20 tokens-per-parameter) classical schedules can be competitive for larger models.
- Requires accurate, low-noise global statistics (loss and L1-gradient norms) with EMA smoothing; distributed aggregation overhead must be managed.
- Works best with models using normalization layers; behavior may differ otherwise and require adaptation.
- AdamC weight decay needs retuning (often larger values); interaction with other regularization (e.g., dropout) should be revalidated.
- Extra memory for x and momentum buffers; slight API changes (evaluate at y, serve x).
- Predictability assumes relatively stationary data; data shifts may reduce forecasting accuracy.
Glossary
- AdamC: An Adam variant with modified weight decay to stabilize gradient norms. "AdamC, a small modification to AdamW designed to produce flatter gradient norm sequences (Defazio, 2025), has nearly constant gradient norm."
- AdamW: An optimizer that decouples weight decay from the gradient-based update. "Schedule-Free AdamW with inner momentum (here 31 = 0.75) performs better at small batch-sizes"
- annealing (learning rate anneal): Gradually reducing the learning rate, often near the end of training. "Training can be early stopped by introducing a linear learning rate anneal to 0"
- anytime training: A training regime that can be stopped at any time and yields a good model. "Schedule-Free Learning has shown promise as a practical anytime training method for machine learning"
- average iterate: The running average of parameter vectors used for evaluation in Schedule-Free methods. "introducing an average iterate buffer xt"
- C-Refinement: A modification to averaging weights for Schedule-Free to improve large-batch stability. "Using C-Refinement does enable training at larger batch sizes"
- checkpoint merging: Combining multiple saved model states (checkpoints) to produce a better model. "provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining."
- Chinchilla: A scaling-law guideline relating model size and data tokens for optimal training efficiency. "Chinchilla recommended 20 tokens per parameter"
- cosine schedule: A learning-rate schedule that follows a cosine decay pattern. "it significantly outperforms Cosine and WSD schedules at medium batch-sizes."
- D-Adaptation: An optimizer family that adapts step sizes without decreasing them, avoiding cyclic instabilities. "such as D-Adaptation (Defazio and Mishchenko, 2023)"
- decoupled weight decay: Applying weight decay as a separate operation from the gradient update. "Using fully-decoupled AdamC is necessary when using the Polyak step size."
- DoG: An adaptive learning-rate method designed to avoid decreasing rate cycles. "DoG (Ivgi et al., 2023)"
- DoWG: A related adaptive method targeting stable, non-decreasing learning-rate sequences. "DoWG (Khaled et al., 2023)"
- effective learning rate: The actual per-layer step size after considering normalization and parameter norms. "and so the effective learning rate is proportional to the gradient norm squared:"
- EMA (exponential moving average): A time-weighted smoothing of a quantity (e.g., gradient norms). "An exponential-moving average (EMA) of the gradient norm"
- G-Lipschitz: A property of functions whose gradients are bounded by a constant G, relevant to convergence rates. "non-smooth G-Lipschitz optimization"
- gradient norm drift: The tendency of gradient magnitudes to systematically increase or decrease during training. "compensating for gradient norm drift"
- hyper-sphere: A high-dimensional sphere on which weight norms can be constrained by decay dynamics. "weight norms to lie on a hyper-sphere"
- inner momentum: The momentum applied within the base optimizer (e.g., AdamW) updates. "reintroducing inner momentum into Schedule-Free Learning"
- inverse-gradient norm weighting: Scaling the step size inversely with gradient magnitude to stabilize progress. "Inverse-Gradient Norm Weighting is Highly Beneficial"
- inverse-square-root (1/sqrt) decay: A convergence pattern where error decreases proportional to 1/√t. "Schedule-Free loss curves follow a predictable 1/sqrt decay"
- iterate averaging: Averaging model parameters over iterations to improve stability and predictability. "applied iterate averaging from the beginning of optimization."
- last-iterate convergence: Guarantees about the final iterate’s performance rather than averages. "provides a (anytime) last-iterate convergence guarantee"
- learning-rate-free optimizer: An optimizer that sets step sizes automatically without manual learning-rate tuning. "gives a practical learning-rate-free optimizer for LLM training."
- Linear Decay schedule: A learning-rate schedule that linearly decreases the step size over time. "The Linear Decay schedule arises from optimizing the right-hand-side of Equation 19"
- L1 norm: The sum of absolute values of a vector’s components; used here to scale step sizes. "scaled inversely proportional to the L1 norm of the gradient:"
- log-linear interpolation: Interpolating parameters linearly in log-space over time. "we use a time-varying 3 given by a log-linear interpolation"
- model averaging: Combining parameters from multiple training steps or checkpoints to produce a better model. "provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining."
- normalization layers: Layers (e.g., LayerNorm) that standardize activations and alter effective step sizes. "networks trained with normalization layers"
- outer momentum (beta): The mixing factor between the current iterate and its average in Schedule-Free methods. "Using increasing outer-momentum in Schedule-Free gives large improvements for long training runs."
- Polyak denominator: The squared gradient term in the Polyak step size; difficult to estimate stably in practice. "Direct estimation of the Polyak denominator is too noisy to be used in practice."
- Polyak step size: A formula that sets step size based on the current function suboptimality and gradient norm. "Polyak step sizes outperform a tuned grid search over L1-adjusted step-sizes"
- Prodigy: An adaptive learning-rate method that avoids decreasing learning-rate sequences. "such as Prodigy (Mishchenko and Defazio, 2023)"
- r-weighting: Weighting scheme for averages that raises time indices to a power r to emphasize later iterates. "a different, approximate implementation of r-weighting"
- river-valley view: A conceptualization of training trajectories as moving along valleys, with progress revealed by annealing. "motivated by a river-valley view of the loss landscape"
- scaling cliff: A sudden loss of stability or performance when increasing batch size. "hits a scaling "cliff" at 2M tokens per batch"
- scaling ladder: A methodical sequence of experiments across increasing model scales. "We ran a series of scaling ladder experiments"
- schedule-dependent learning rate bounds: Theoretical bounds that explain loss curves under specific schedules. "predicted by the use of schedule-dependent learning rate bounds"
- Schedule-Free Learning: An approach that replaces learning-rate schedules with aggressive iterate averaging. "Schedule-Free Learning modifies the learning process by introducing an average iterate buffer xt"
- ScheduleFree+: The combination of techniques (e.g., Polyak step, momentum tweaks) proposed in the paper. "We call the combination of these approaches ScheduleFree+."
- square-root scaling law: Rule that optimal learning rate scales with the square root of batch size. "square-root scaling law for learning rate given batch-size is well-established"
- stochastic gradient: A gradient estimate computed from a random mini-batch of data. "computed from the stochastic gradient at zt"
- tail-averaging: Averaging only the final portion of iterates to improve last-iterate performance. "or use some sort of tail-averaging."
- Taylor expansion: Using a first-order expansion of the loss function around a point to approximate values. "using the Taylor expansion of f around the point yt"
- tokens per parameter (TPP): Training budget measured as tokens seen per model parameter. "1000 tokens per parameter"
- warm-starting: Initializing training from a previously trained checkpoint to accelerate convergence. "Warm-starting from the same 2B token checkpoint is used for all runs"
- warmup: An initial phase with gradually increasing learning rate to stabilize early training. "however a learning rate warmup is still needed for best performance."
- Warmup-Stable-Decay (WSD) schedule: A schedule with warmup, a constant plateau, and optional final anneal. "WSD schedules start with a standard learning-rate warmup"
- weight decay: A regularization technique that shrinks weights to control their norms. "the weight-decay hyper-parameter"
- weight projection: Constraining parameter norms during training by projecting back to a target norm. "Weight projection can be used, so that weight norms do not change during training."
- weight quantization: Representing weights with low-precision numbers, sensitive to large norm drift. "cause issues with weight quantization."
- weighted regret: A regret measure weighted by step sizes, used in convergence analysis of averages. "depends on the weighted regret of the base optimizer."
Collections
Sign up for free to add this paper to one or more collections.