Papers
Topics
Authors
Recent
2000 character limit reached

Linear Learning Rate Scaling

Updated 26 December 2025
  • Linear Learning Rate Scaling is a technique that sets the learning rate as a linear function of training steps or model width, ensuring worst-case optimality.
  • It incorporates gradient-norm adaptive scheduling to refine the base rate, enabling task-adaptive improvements and stable performance across various optimization scenarios.
  • Empirical evaluations demonstrate that linear decay methods outperform alternatives like cosine or stepwise schedules in tasks ranging from logistic regression to large language models.

Linear learning rate scaling is an approach to learning rate schedule design that assigns the step-size ηt\eta_t as a linear function of training time or model width, ensuring optimal convergence properties and transferability in both convex optimization and parameterized neural network training. It possesses rigorous worst-case optimality, supports task-adaptive refinements, and has empirically outperformed a variety of commonly used alternatives across logistic regression, deep learning, and LLM applications (Defazio et al., 2023), with theoretical results extending to infinite-width neural networks under the μ\muP parametrization (Hayou, 3 Nov 2025).

1. Formal Derivation and Optimality in Convex Optimization

Consider stochastic convex optimization where the goal is to minimize f:RdRf: \mathbb{R}^d \to \mathbb{R} over iterates x1,,xTx_{1},\dots,x_{T} with updates xt+1=xtηtgtx_{t+1}=x_{t}-\eta_{t}g_{t} and bounded subgradient norms gt2G2\|g_{t}\|^2\leq G^2. Standard theory provides regret bounds for averaged iterates, while practitioners often use the last iterate xTx_T. A tight reduction from regret to last-iterate performance establishes that for step-size ηt(1t/T)\eta_{t} \propto (1-t/T), the last-iterate excess loss matches the minimax rate:

E[f(xT)f]DGT\mathbb{E}[f(x_T) - f_*] \leq \frac{DG}{\sqrt{T}}

where D=x1uD=\|x_1-u\| and ff_* is the global minimum. Explicitly, the schedule is

ηt=DGT(1tT)\eta_t = \frac{D}{G \sqrt{T}} \left( 1 - \frac{t}{T} \right)

This result holds for any base optimizer (SGD, AdaGrad, AMSGrad, empirically Adam) with low regret and is worst-case optimal for last-iterate SGD (Defazio et al., 2023).

2. Algorithmic Specification and Scheduling

Implementation employs a single hyperparameter base step α0\alpha_0 and decays it linearly over total steps TT:

1
2
3
4
5
for t in range(1, T+1):
    alpha = alpha_0 * (1 - t / T)
    g = stochastic_gradient(x)
    x = x - alpha * g
return x

Typical values for TT are planned optimizer updates, with α0\alpha_0 swept over $1-2$ orders of magnitude near best constant rates. A linear warm-up ramping α\alpha from 0 to α0\alpha_0 for $10$-20%20\% of the schedule stabilizes training.

3. Data-Driven Refinement: Gradient-Norm Adaptive Scheduling

Beyond worst-case universality, one can refine the linear schedule by leveraging observed gradient norms:

  1. Run a pilot training with base linear schedule, log gt\|g_t\|.
  2. Assign weights wt1/gt2w_t \propto 1/\|g_t\|^2 (or coordinate-wise for Adam: wt1/gt1w_t \propto 1/\|g_t\|_1).
  3. Construct the refined step-size via

ηt=wtp=t+1Twp\eta_t = w_t \sum_{p=t+1}^T w_p

  1. Normalize so that maxtηt=1\max_t \eta_t = 1, scale by α0\alpha_0, and retrain.

Features of this approach:

  • Warm-up emerges if early gradient norms are small.
  • Rapid annealing occurs when gradients collapse late in training.
  • Directly solves a data-dependent performance bound for the last iterate.
  • Applicable to any optimizer, including per-coordinate Adam variants (Defazio et al., 2023).

4. Empirical Evaluation and Comparative Performance

A comprehensive suite of evaluations demonstrates consistent superiority of linear decay and its refinement:

  • Convex logistic regression (LIBSVM) and deep tasks (CIFAR-10/100, ImageNet, LSTM, ViT, GPT/LLaMA, Faster-RCNN) show linear decay matches or exceeds constant, 1/t1/\sqrt{t}, $1/t$, and stepwise schedules.
  • On popular cosine annealing, linear decay ties or wins in 9 of 10 deep learning tasks.
  • Refined schedules provide additional small gains, especially for NLP and regression.
  • Stable across short runs (<<10 epochs), unlike cosine.
  • Sample result (CIFAR-10 test error):
Schedule Flat 1/t 1/√t Step Cosine Linear Refined
CIFAR-10 8.04 5.42 6.37 4.78 4.27 4.35 4.31

LLM trunk (C4 dataset) shows perplexity improvement as well:

Model size Cosine Linear Refined
117 M 3.089 3.087 3.075
1 B 2.729 2.725 2.722
3.5 B 2.631 2.625 2.634

Linear decay remains robust even as model size, task, and optimizer vary (Defazio et al., 2023).

5. Width-Scale Transferability in Deep Networks

The optimal learning rate transfer under varying neural network width is rigorously established under μ\muP ("Maximal Update Parametrization"). For a linear MLP

f(x)=VWLWL1W1W0xf(x) = V^\top\,W_L\,W_{L-1}\cdots W_1\,W_0\,x

initialized and updated by μ\muP rules, the optimal constant learning rate ηn\eta^*_n for width nn provably converges to a strictly positive limit as nn \to \infty:

ηnη>0\eta_n^* \to \eta_\infty^*>0

Alternative schemes—Standard Parametrization (SP) and Neural Tangent Parametrization (NTP)—do not possess this property; under SP, ηn0\eta_n^* \to 0, while in NTP, optimal rates diverge with width. For nonlinearity (ReLU, Adam), μ\muP extension (with Adam step ηη/n\eta\to\eta/n) achieves near width-independence empirically. The optimal rate also decreases approximately as $1/L$ in deep networks (Hayou, 3 Nov 2025).

6. Practical Recommendations and Limitations

  • Default to linear decay (with small warm-up) for most tasks; it requires only tuning the base rate.
  • Exploit task-adaptive refinement using gradient-norm logs when available, particularly beneficial for NLP and regression.
  • In models with rapid gradient-norm collapse at the end, avoid refinement as step-size may diverge.
  • For neural networks, use μ\muP parametrization to enable wide-model learning rate transfer; tune on small proxies, reuse for large widths.
  • For Adam, employ ηη/n\eta \to \eta/n scaling under μ\muP.
  • Traditional schemes require re-tuning as width increases; μ\muP obviates this necessity.

7. Context and Implications

Linear learning rate scaling offers worst-case optimality in last-iterate convex optimization and robust empirical performance across model families and tasks. Adaptive refinement supplies additional gains without ad hoc heuristics, and width-independence under μ\muP enhances scalability of learning rate selection for modern deep architectures. A plausible implication is that the widespread use of non-linear or ad hoc schedules (e.g., cosine) may underperform linear decay in practical high-dimensional problems, especially for large-scale deep or LLMs unless rigorously justified by specific empirical properties.

References: (Defazio et al., 2023, Hayou, 3 Nov 2025)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Linear Learning Rate Scaling.