Linear Learning Rate Scaling

Updated 26 December 2025

Linear Learning Rate Scaling is a technique that sets the learning rate as a linear function of training steps or model width, ensuring worst-case optimality.
It incorporates gradient-norm adaptive scheduling to refine the base rate, enabling task-adaptive improvements and stable performance across various optimization scenarios.
Empirical evaluations demonstrate that linear decay methods outperform alternatives like cosine or stepwise schedules in tasks ranging from logistic regression to large language models.

Linear learning rate scaling is an approach to learning rate schedule design that assigns the step-size $\eta_t$ as a linear function of training time or model width, ensuring optimal convergence properties and transferability in both convex optimization and parameterized neural network training. It possesses rigorous worst-case optimality, supports task-adaptive refinements, and has empirically outperformed a variety of commonly used alternatives across logistic regression, deep learning, and LLM applications (Defazio et al., 2023), with theoretical results extending to infinite-width neural networks under the $\mu$ P parametrization (Hayou, 3 Nov 2025).

1. Formal Derivation and Optimality in Convex Optimization

Consider stochastic convex optimization where the goal is to minimize $f: \mathbb{R}^d \to \mathbb{R}$ over iterates $x_{1},\dots,x_{T}$ with updates $x_{t+1}=x_{t}-\eta_{t}g_{t}$ and bounded subgradient norms $\|g_{t}\|^2\leq G^2$ . Standard theory provides regret bounds for averaged iterates, while practitioners often use the last iterate $x_T$ . A tight reduction from regret to last-iterate performance establishes that for step-size $\eta_{t} \propto (1-t/T)$ , the last-iterate excess loss matches the minimax rate:

$\mathbb{E}[f(x_T) - f_*] \leq \frac{DG}{\sqrt{T}}$

where $D=\|x_1-u\|$ and $f_*$ is the global minimum. Explicitly, the schedule is

$\eta_t = \frac{D}{G \sqrt{T}} \left( 1 - \frac{t}{T} \right)$

This result holds for any base optimizer (SGD, AdaGrad, AMSGrad, empirically Adam) with low regret and is worst-case optimal for last-iterate SGD (Defazio et al., 2023).

2. Algorithmic Specification and Scheduling

Implementation employs a single hyperparameter base step $\alpha_0$ and decays it linearly over total steps $T$ :

for t in range(1, T+1):
    alpha = alpha_0 * (1 - t / T)
    g = stochastic_gradient(x)
    x = x - alpha * g
return x

Typical values for $T$ are planned optimizer updates, with $\alpha_0$ swept over $1-2$ orders of magnitude near best constant rates. A linear warm-up ramping $\alpha$ from 0 to $\alpha_0$ for $10$- $20\%$ of the schedule stabilizes training.

Beyond worst-case universality, one can refine the linear schedule by leveraging observed gradient norms:

Run a pilot training with base linear schedule, log $\|g_t\|$ .
Assign weights $w_t \propto 1/\|g_t\|^2$ (or coordinate-wise for Adam: $w_t \propto 1/\|g_t\|_1$ ).
Construct the refined step-size via

$\eta_t = w_t \sum_{p=t+1}^T w_p$

Normalize so that $\max_t \eta_t = 1$ , scale by $\alpha_0$ , and retrain.

Features of this approach:

Warm-up emerges if early gradient norms are small.
Rapid annealing occurs when gradients collapse late in training.
Directly solves a data-dependent performance bound for the last iterate.
Applicable to any optimizer, including per-coordinate Adam variants (Defazio et al., 2023).

4. Empirical Evaluation and Comparative Performance

A comprehensive suite of evaluations demonstrates consistent superiority of linear decay and its refinement:

Convex logistic regression (LIBSVM) and deep tasks (CIFAR-10/100, ImageNet, LSTM, ViT, GPT/LLaMA, Faster-RCNN) show linear decay matches or exceeds constant, $1/\sqrt{t}$ , $1/t$, and stepwise schedules.
On popular cosine annealing, linear decay ties or wins in 9 of 10 deep learning tasks.
Refined schedules provide additional small gains, especially for NLP and regression.
Stable across short runs ( $<$ 10 epochs), unlike cosine.
Sample result (CIFAR-10 test error):

Schedule	Flat	1/t	1/√t	Step	Cosine	Linear	Refined
CIFAR-10	8.04	5.42	6.37	4.78	4.27	4.35	4.31

LLM trunk (C4 dataset) shows perplexity improvement as well:

Model size	Cosine	Linear	Refined
117 M	3.089	3.087	3.075
1 B	2.729	2.725	2.722
3.5 B	2.631	2.625	2.634

Linear decay remains robust even as model size, task, and optimizer vary (Defazio et al., 2023).

5. Width-Scale Transferability in Deep Networks

The optimal learning rate transfer under varying neural network width is rigorously established under $\mu$ P ("Maximal Update Parametrization"). For a linear MLP

$f(x) = V^\top\,W_L\,W_{L-1}\cdots W_1\,W_0\,x$

initialized and updated by $\mu$ P rules, the optimal constant learning rate $\eta^*_n$ for width $n$ provably converges to a strictly positive limit as $n \to \infty$ :

$\eta_n^* \to \eta_\infty^*>0$

Alternative schemes—Standard Parametrization (SP) and Neural Tangent Parametrization (NTP)—do not possess this property; under SP, $\eta_n^* \to 0$ , while in NTP, optimal rates diverge with width. For nonlinearity (ReLU, Adam), $\mu$ P extension (with Adam step $\eta\to\eta/n$ ) achieves near width-independence empirically. The optimal rate also decreases approximately as $1/L$ in deep networks (Hayou, 3 Nov 2025).

6. Practical Recommendations and Limitations

Default to linear decay (with small warm-up) for most tasks; it requires only tuning the base rate.
Exploit task-adaptive refinement using gradient-norm logs when available, particularly beneficial for NLP and regression.
In models with rapid gradient-norm collapse at the end, avoid refinement as step-size may diverge.
For neural networks, use $\mu$ P parametrization to enable wide-model learning rate transfer; tune on small proxies, reuse for large widths.
For Adam, employ $\eta \to \eta/n$ scaling under $\mu$ P.
Traditional schemes require re-tuning as width increases; $\mu$ P obviates this necessity.

7. Context and Implications

Linear learning rate scaling offers worst-case optimality in last-iterate convex optimization and robust empirical performance across model families and tasks. Adaptive refinement supplies additional gains without ad hoc heuristics, and width-independence under $\mu$ P enhances scalability of learning rate selection for modern deep architectures. A plausible implication is that the widespread use of non-linear or ad hoc schedules (e.g., cosine) may underperform linear decay in practical high-dimensional problems, especially for large-scale deep or LLMs unless rigorously justified by specific empirical properties.

References: (Defazio et al., 2023, Hayou, 3 Nov 2025)

PDF Markdown Chat (Pro)

References (2)

Optimal Linear Decay Learning Rate Schedules and Further Refinements (2023)

A Proof of Learning Rate Transfer under $μ$P (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Linear Learning Rate Scaling.

Linear Learning Rate Scaling

1. Formal Derivation and Optimality in Convex Optimization

2. Algorithmic Specification and Scheduling

3. Data-Driven Refinement: Gradient-Norm Adaptive Scheduling

4. Empirical Evaluation and Comparative Performance

5. Width-Scale Transferability in Deep Networks

6. Practical Recommendations and Limitations

7. Context and Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Linear Learning Rate Scaling

1. Formal Derivation and Optimality in Convex Optimization

2. Algorithmic Specification and Scheduling

3. Data-Driven Refinement: Gradient-Norm Adaptive Scheduling

4. Empirical Evaluation and Comparative Performance

5. Width-Scale Transferability in Deep Networks

6. Practical Recommendations and Limitations

7. Context and Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics