Linear Learning Rate Scaling
- Linear Learning Rate Scaling is a technique that sets the learning rate as a linear function of training steps or model width, ensuring worst-case optimality.
- It incorporates gradient-norm adaptive scheduling to refine the base rate, enabling task-adaptive improvements and stable performance across various optimization scenarios.
- Empirical evaluations demonstrate that linear decay methods outperform alternatives like cosine or stepwise schedules in tasks ranging from logistic regression to large language models.
Linear learning rate scaling is an approach to learning rate schedule design that assigns the step-size as a linear function of training time or model width, ensuring optimal convergence properties and transferability in both convex optimization and parameterized neural network training. It possesses rigorous worst-case optimality, supports task-adaptive refinements, and has empirically outperformed a variety of commonly used alternatives across logistic regression, deep learning, and LLM applications (Defazio et al., 2023), with theoretical results extending to infinite-width neural networks under the P parametrization (Hayou, 3 Nov 2025).
1. Formal Derivation and Optimality in Convex Optimization
Consider stochastic convex optimization where the goal is to minimize over iterates with updates and bounded subgradient norms . Standard theory provides regret bounds for averaged iterates, while practitioners often use the last iterate . A tight reduction from regret to last-iterate performance establishes that for step-size , the last-iterate excess loss matches the minimax rate:
where and is the global minimum. Explicitly, the schedule is
This result holds for any base optimizer (SGD, AdaGrad, AMSGrad, empirically Adam) with low regret and is worst-case optimal for last-iterate SGD (Defazio et al., 2023).
2. Algorithmic Specification and Scheduling
Implementation employs a single hyperparameter base step and decays it linearly over total steps :
1 2 3 4 5 |
for t in range(1, T+1): alpha = alpha_0 * (1 - t / T) g = stochastic_gradient(x) x = x - alpha * g return x |
Typical values for are planned optimizer updates, with swept over $1-2$ orders of magnitude near best constant rates. A linear warm-up ramping from 0 to for $10$- of the schedule stabilizes training.
3. Data-Driven Refinement: Gradient-Norm Adaptive Scheduling
Beyond worst-case universality, one can refine the linear schedule by leveraging observed gradient norms:
- Run a pilot training with base linear schedule, log .
- Assign weights (or coordinate-wise for Adam: ).
- Construct the refined step-size via
- Normalize so that , scale by , and retrain.
Features of this approach:
- Warm-up emerges if early gradient norms are small.
- Rapid annealing occurs when gradients collapse late in training.
- Directly solves a data-dependent performance bound for the last iterate.
- Applicable to any optimizer, including per-coordinate Adam variants (Defazio et al., 2023).
4. Empirical Evaluation and Comparative Performance
A comprehensive suite of evaluations demonstrates consistent superiority of linear decay and its refinement:
- Convex logistic regression (LIBSVM) and deep tasks (CIFAR-10/100, ImageNet, LSTM, ViT, GPT/LLaMA, Faster-RCNN) show linear decay matches or exceeds constant, , $1/t$, and stepwise schedules.
- On popular cosine annealing, linear decay ties or wins in 9 of 10 deep learning tasks.
- Refined schedules provide additional small gains, especially for NLP and regression.
- Stable across short runs (10 epochs), unlike cosine.
- Sample result (CIFAR-10 test error):
| Schedule | Flat | 1/t | 1/√t | Step | Cosine | Linear | Refined |
|---|---|---|---|---|---|---|---|
| CIFAR-10 | 8.04 | 5.42 | 6.37 | 4.78 | 4.27 | 4.35 | 4.31 |
LLM trunk (C4 dataset) shows perplexity improvement as well:
| Model size | Cosine | Linear | Refined |
|---|---|---|---|
| 117 M | 3.089 | 3.087 | 3.075 |
| 1 B | 2.729 | 2.725 | 2.722 |
| 3.5 B | 2.631 | 2.625 | 2.634 |
Linear decay remains robust even as model size, task, and optimizer vary (Defazio et al., 2023).
5. Width-Scale Transferability in Deep Networks
The optimal learning rate transfer under varying neural network width is rigorously established under P ("Maximal Update Parametrization"). For a linear MLP
initialized and updated by P rules, the optimal constant learning rate for width provably converges to a strictly positive limit as :
Alternative schemes—Standard Parametrization (SP) and Neural Tangent Parametrization (NTP)—do not possess this property; under SP, , while in NTP, optimal rates diverge with width. For nonlinearity (ReLU, Adam), P extension (with Adam step ) achieves near width-independence empirically. The optimal rate also decreases approximately as $1/L$ in deep networks (Hayou, 3 Nov 2025).
6. Practical Recommendations and Limitations
- Default to linear decay (with small warm-up) for most tasks; it requires only tuning the base rate.
- Exploit task-adaptive refinement using gradient-norm logs when available, particularly beneficial for NLP and regression.
- In models with rapid gradient-norm collapse at the end, avoid refinement as step-size may diverge.
- For neural networks, use P parametrization to enable wide-model learning rate transfer; tune on small proxies, reuse for large widths.
- For Adam, employ scaling under P.
- Traditional schemes require re-tuning as width increases; P obviates this necessity.
7. Context and Implications
Linear learning rate scaling offers worst-case optimality in last-iterate convex optimization and robust empirical performance across model families and tasks. Adaptive refinement supplies additional gains without ad hoc heuristics, and width-independence under P enhances scalability of learning rate selection for modern deep architectures. A plausible implication is that the widespread use of non-linear or ad hoc schedules (e.g., cosine) may underperform linear decay in practical high-dimensional problems, especially for large-scale deep or LLMs unless rigorously justified by specific empirical properties.
References: (Defazio et al., 2023, Hayou, 3 Nov 2025)