Learning Rate Transfer with Width

Updated 8 November 2025

Learning rate transfer with width is defined as transferring an optimal learning rate from a narrow model to a wider one, ensuring stable training dynamics via maximal-update parameterization.
Empirical and theoretical studies show that under μP scaling, loss landscape properties remain width-invariant, allowing zero-shot hyperparameter transfer across models.
Practical implementations require appropriate weight decay scaling and warmup schedules to maintain feature learning and effective optimizer stability in wide networks.

Learning rate transfer with width refers to the phenomenon wherein the optimal or near-optimal learning rate for neural network training can be determined using a small, narrow model and then reliably transferred to a much wider model. The precise conditions under which this transferability holds, and the parameterization and algorithmic adjustments necessary to preserve it, have been the subject of extensive theoretical and empirical research. This property is tightly linked to the underlying feature learning regime, loss landscape geometry, and the scaling rules adopted for both network initialization and learning rate adjustment.

1. Mathematical Foundations and Parameterization Regimes

The behavior of learning rate transfer with respect to width is determined by the interplay between parameterization schemes and width scaling limits. In neural network theory, three major parameterization regimes are distinguished:

Standard Parameterization (SP): Weights are initialized as $\mathcal{N}(0, \sigma_w^2 / n)$ , and the learning rate typically must decrease inversely with width $n$ to ensure stability. In the infinite width limit, networks enter a kernel regime where features become static during training and the effective learning rate vanishes as $n \to \infty$ (Yang et al., 2020).
Neural Tangent Kernel (NTK) Parameterization: Weights are scaled such that the network converges to a fixed kernel (the NTK) as width grows. The optimal learning rate often remains nominally constant with width, but feature learning disappears and hyperparameter transfer fails: learning rates from finite-width networks do not yield similar dynamics in the infinite-width limit (Yang et al., 2020, Bordelon et al., 4 Feb 2025).
Maximal-Update Parameterization ( $\mu$ P): A specific parameterization in which both initialization and learning rates are tuned such that the per-parameter update magnitude remains constant with width, maximizing feature learning at all scales. Under $\mu$ P, both theoretical and empirical results show that the optimal learning rate $\eta_n$ converges to a positive constant $\eta_\infty$ as $n \to \infty$ , and the learning rate discovered in a small-width model is directly transferable to much wider networks (Hayou, 3 Nov 2025, Noci et al., 27 Feb 2024, Bordelon et al., 4 Feb 2025, Vyas et al., 2023).

The following table summarizes learning rate transfer as a function of parameterization:

Parameterization	$\displaystyle \lim_{n\to\infty}\eta_n^\star$	LR Transfer with Width
$\mu$ P	constant $>0$	Yes
Standard (SP)	$0$	No
NTK	$\infty$	No

$\mu$ P parameterization is thus necessary for robust learning rate transfer as width increases (Hayou, 3 Nov 2025, Noci et al., 27 Feb 2024).

2. Mechanistic Explanations and Loss Landscape Geometry

Key insights into learning rate transferability with width arise from the spectral properties of the loss landscape and their dependence, or independence, on width. Specifically:

Super Consistency: Under $\mu$ P scaling, after an initial rapid regime, the largest eigenvalue (sharpness) of the loss Hessian stabilizes at a value independent of width (and, with proper scaling, depth). This "super consistency" means that the local geometry traversed by optimizers (e.g., SGD, Adam) is nearly invariant as width increases, and thus the stability constraints on learning rate do not change with width (Noci et al., 27 Feb 2024).
Edge of Stability (EoS): In $\mu$ P, training tends to drive sharpness to the EoS threshold $\lambda_\mathrm{max} \approx 2/\eta$ , where $\eta$ is the learning rate. This property holds across width (and depth, with depth-appropriate scaling), providing a direct correspondence between width-invariant optimizer stability and learning rate transfer (Noci et al., 27 Feb 2024, Vyas et al., 2023).
Contrast with NTK/Standard Regimes: In NTK or standard parameterization, sharpness decreases with width, so stability and the optimal learning rate require width-dependent tuning; learning rate transfer fails (Hayou, 3 Nov 2025, Bordelon et al., 4 Feb 2025).
Feature Learning vs. Lazy Regime: Learning rate transfer is only observed when feature learning persists with width (as in $\mu$ P); in the "lazy" regime (NTK, SP), the functional changes induced by parameter updates vanish with width, and learning rate transfer is impossible (Yang et al., 2020, Vyas et al., 2023).

3. Empirical and Theoretical Results in Learning Rate Transfer

Rigorous theoretical results and empirical demonstrations confirm the transferability of learning rates in the $\mu$ P regime:

Linear and Deep Networks: The optimal learning rate in linear MLPs under $\mu$ P converges as $n \to \infty$ to a deterministic positive constant, uniformly over training steps and random initializations. The convergence rate is $O(n^{-1/2})$ for fixed finite steps (Hayou, 3 Nov 2025). These results empirically extend to deep nonlinear architectures and to adaptive optimizers like Adam.
Residual and Convolutional Architectures: For deep residual networks, learning rate transfer with width (and depth) holds under appropriate branch scaling (e.g., ResNet branch weights $\sim 1/\sqrt{\text{depth}}$ ). Arithmetic-Mean $\mu$ P (AM- $\mu$ P) parameterization enforces a network-wide update energy budget, and the maximal stable learning rate scales as $\eta^* \propto L^{-3/2}$ , enabling depth and width transfer in modern CNNs and ResNets (Zhang et al., 5 Oct 2025).
Improved Standard Parameterization: An improved variant of standard parameterization preserves both a well-behaved NTK and the correct width scaling for optimal learning rates, confirming empirical match to finite-width networks (Sohl-dickstein et al., 2020).
Batch Size and SGD Noise: In Stochastic Gradient Descent, the "optimal normalized noise scale" for SGD (which depends on learning rate, batch size, and initialization variance) scales linearly with width; thus either learning rate must decrease, or batch size must decrease as width increases, depending on parameterization (Park et al., 2019).

4. Nuances, Limitations, and Practical Adjustments

Recent large-scale empirical work has challenged the completeness of $\mu$ P's explanation for learning rate transfer in practice:

Role of Weight Decay: In practical scale-invariant architectures (e.g., Transformers with LayerNorm) and when using AdamW, weight decay—not $\mu$ P scaling—primarily stabilizes the relative update size of internal representations across widths beyond the initial phase of training. The crucial quantity is the product $\eta \lambda$ (learning rate × weight decay). For stable transfer, as width grows by a factor $m$ : set $(\eta, \lambda) \to (\eta/m, m\lambda)$ . This preserves $\eta \lambda$ , ensuring width-invariant training dynamics in steady state (Kosson et al., 21 Oct 2025, Fan et al., 17 Oct 2025).
Breakdown of Alignment Assumptions: The theoretical alignment assumptions of $\mu$ P (about correlations of weights, inputs, and updates) hold only at initialization, rapidly breaking down after the warmup stage especially when batch size far exceeds width (Kosson et al., 21 Oct 2025).
Learning Rate Warmup: In practice, $\mu$ P acts as a kind of implicit learning rate warmup. If strong exponential warmup schedules are adopted and independent (decoupled) weight decay is used, learning rate transfer can be achieved without $\mu$ P scaling (Kosson et al., 21 Oct 2025).
Function-Space Learning Rate (FLeRM): Layerwise matching of output changes in function space enables robust width-wise learning rate transfer even in arbitrary architectures and optimizer settings, empirically matching training dynamics across scale (Milsom et al., 24 Feb 2025).

The following table summarizes key scaling prescriptions for matrix-like parameters (AdamW):

Parameter	Scaling with width $d$
Learning rate ( $\eta_2$ )	$d^{-1}$
Weight decay ( $\lambda_2$ )	$d^{1/2}$

This combination keeps the RMS norm and top singular value of parameter matrices width-invariant (Fan et al., 17 Oct 2025).

5. Transfer Learning and Feature Quality with Increasing Width

Wider networks not only support learning rate transfer, but also improve the transferability of learned features:

Distributed Representations: Wide networks encode input information more redundantly, in distributed groupings, leading to richer and more transferable feature spaces (Gilboa et al., 2019).
Fine-tuning and Downstream Tasks: When only the last layer is fine-tuned for new tasks, performance on transfer tasks increases sharply with width, even when accuracy on the original supervised task saturates (Gilboa et al., 2019).
Preservation of Nuanced Structure: Wide networks’ hidden states preserve expressive cluster structure corresponding to subtle input variations (e.g., digit style or translation) that are absent in bottlenecked (narrow) networks (Gilboa et al., 2019).
Theoretical Treatments: Transfer learning in infinite-width networks under mean-field/ $\mu$ P scaling requires both correct learning rate scaling and sufficient feature learning strength; the effectiveness of transfer further depends on task similarity and explicit regularization (e.g., elastic weight coupling) (Lauditi et al., 6 Jul 2025).

6. Layerwise Learning Rate Schemes, Task Complexity, and Adaptive Width

Layerwise Learning Rate Adaptation: In transfer learning and knowledge distillation—especially when compressing from wide to narrow models—layerwise adaptation of learning rates, using metrics like attention-map or Jacobian divergence between layers, yields more stable and higher performance gains in complex tasks. This is particularly pronounced as the gap in width increases between teacher and student models, emphasizing the need for adaptive mechanisms beyond naive global learning rate transfer (Kokane et al., 5 Jul 2024).
Adaptive Width Methods: Techniques that allow the network to learn its effective width during training eliminate the necessity of width-specific hyperparameter tuning, indirectly enabling robust learning rate schedules across dynamically varying widths (Errica et al., 27 Jan 2025).

7. Practical Recommendations and Contemporary Challenges

Pilot Tuning and "Zero-shot" Transfer: For architectures using $\mu$ P or AM- $\mu$ P parameterization, tune the learning rate (and weight decay as appropriate) on a proxy (narrow/small) model; the same value transfers to any width with no loss of stability or speed (Zhang et al., 5 Oct 2025).
Inclusion of Weight Decay Scaling: In modern architectures, always scale weight decay ( $\lambda\propto\sqrt{d}$ for AdamW) together with learning rate ( $\eta\propto d^{-1}$ ) to ensure sublayer gain invariance and width-robust hyperparameter transfer (Fan et al., 17 Oct 2025, Kosson et al., 21 Oct 2025).
Learning Rate Schedules and Initialization: For effective width transfer, use strong learning rate warmup schedules and residual-aware initialization, particularly in the context of Transformer or ResNet training at scale (Zhang et al., 5 Oct 2025).
Parameterization Awareness: The principle of learning rate transfer is fundamentally parameterization-dependent: without $\mu$ P, improved-standard, or function-space matched scaling, learning rates discovered at one width are not predictive of optimality at another width (Bordelon et al., 4 Feb 2025, Yang et al., 2020).
Empirical Tuning: Despite theoretical transferability under $\mu$ P, practical factors (e.g., activation normalization, optimizer-specific dynamics, non-ideal alignment, high batch sizes, hardware-induced noise) may demand empirical checks, especially in regimes far from initialization or in very deep or heterogeneous models (Kosson et al., 21 Oct 2025, Fan et al., 17 Oct 2025).

In summary, learning rate transfer across widths is robustly achieved in the maximal-update ( $\mu$ P) or mean-field scaling regime, where the loss landscape sharpness and feature learning dynamics are width-invariant. Both spectral analysis of the loss landscape ("super consistency") and rigorous proofs in the linear case establish that the optimal learning rate approaches a fixed limit as width grows, enabling principled "zero-shot" hyperparameter transfer. In practice, the correct scaling of weight decay in conjunction with the learning rate, the use of independent (decoupled) weight decay, and, where applicable, warmup scheduling and function-space calibration are crucial to preserving this invariance beyond the early stages of training. For model families in which $\mu$ P (and its generalizations) do not hold—such as standard parameterization or NTK—the learning rate must be re-tuned for each width, and transferability collapses. These principles fundamentally inform efficient training, scaling, and transfer learning in contemporary deep learning.