Maximal Update Parametrization (μP)
- μP is a framework that scales weight updates to ensure rich feature learning across network layers even in the infinite-width limit.
- It prescribes specific scaling rules for weights, initialization, and learning rates, enabling hyperparameter settings to transfer from small proxy models to larger ones.
- μP underpins robust training dynamics in architectures like Transformers, ResNets, and MoE, providing both theoretical guarantees and practical efficiency.
Maximal Update Parametrization (μP) is a parameterization framework for deep neural networks designed to ensure nontrivial feature learning and scalable hyperparameter transfer across model sizes. Unlike standard or NTK parameterizations, μP prescribes scaling of weights, initialization, and learning rates such that every parameter group in a network updates with maximal allowed magnitude consistent with stable training dynamics. This design enables both provable feature evolution in the infinite-width limit and the empirical property that hyperparameters tuned on small “proxy” models transfer to much larger ones without adjustment.
1. Foundational Principles and the abc-Parametrization Framework
The core of μP is the abc-parametrization, which assigns each weight tensor of a deep network a set of scaling exponents (a, b) and specifies a global learning-rate exponent c. For a parameter matrix , this is expressed as:
where has entries initialized as independent , and the scalar learning rate is ( is typically the hidden width).
- Standard Parameterization (SP): for all , with , , learning rate must scale as $1/n$ for stability but loses feature learning in the infinite-width limit.
- NTK Parametrization: , ; results in “lazy” training, i.e., features remain fixed near initialization.
- μP: Input layer ; hidden layers ; output ; set all , (learning rate with ). This achieves order-1 updates (“maximal” without instability) and supports nontrivial feature evolution at all depths (Yang et al., 2020).
A critical result is that any parametrization in this natural space either admits feature learning (r=0 and the μP limit), or reduces to a kernel regime (r>0) with frozen features, but not both. This dichotomy is precisely characterized in terms of the “r-value,” a function of . Uniform Parameterizations interpolate between NTK () and μP () (Yang et al., 2020).
2. Training Dynamics and Infinite-Width Feature Learning
Under μP, the signal propagation and parameter updates in an L-layer network remain order-1 in all layers as width . The Tensor Programs technique rigorously computes deterministic limits for the evolution of activations, gradients, and loss derivatives, showing that for any pseudo-Lipschitz test function, empirical means over coordinates converge almost surely to analytic expectations. This machinery confirms:
- Nontrivial feature evolution: Hidden representations change substantially during training, not just in shallow but in all deep networks, remaining linearly independent for any layer and time (Chen et al., 12 Mar 2025).
- Global convergence: Under mild assumptions on the activation and dataset, the only convergent points of SGD under μP are global minima, since residual errors necessarily drive further change in a dynamically evolving feature space (Chen et al., 12 Mar 2025).
- Covariance control: Despite maximal updates, the second-order statistics (e.g., covariance across data points) remain stably controlled at all times.
This stands in contrast to NTK/SP, in which features vary by only and the network acts functionally as a fixed kernel.
3. Hyperparameter Transfer, μTransfer, and Zero-Shot Scaling
An empirically and practically significant property of μP is hyperparameter stability: when μP scalings are adopted, the optimal learning rates, initialization multipliers, and related hyperparameters remain approximately invariant as model width is varied. This enables the μTransfer paradigm (Yang et al., 2022), in which:
- Hyperparameters are tuned on a small proxy model (reduced width, depth) using μP-prescribed scalings.
- These settings (“zero-shot”) transfer to a target, much larger model without further tuning. In benchmarks (e.g., BERT-large from a 13M-parameter proxy, GPT-3 from a 40M-proxy), transferred HPs outperformed published defaults, and total tuning cost was reduced to ~one full training run.
- μP is compatible with architectures such as Transformers, ResNets, and, via recent extensions, even Fourier Neural Operators (Li et al., 24 Jun 2025) and Mixture-of-Experts (Małaśnicki et al., 13 Aug 2025).
In ablation studies on transformers with up to 10B parameters and 190B tokens, μP enabled direct transfer of the optimal learning rate from 2M-parameter proxies to full-scale models, even with architectural variants (e.g., multi-query attention, nonlinearities, batch size changes) (Lingle, 8 Apr 2024).
4. Practical Scaling Rules and Implementation
μP prescribes concrete rules for scaling initialization and learning rates for each parameter group, based on their “role” in the model:
Parameter Type | Initialization Variance | Learning Rate | Notes |
---|---|---|---|
Embeddings (WE) | Θ(1) | base rate α | Invariant with width |
Hidden weights (W) | Θ(1/n) | α·P/n or α | P=proxy width, n=target width |
Output weights | Θ(1/n²) | α/n | For collapsing n-dimensional hidden to output |
FNO Kernel params (R) | Θ(1/(d log K)) | α/√(d log K) | K: Fourier modes, d: PDE dimension |
MoE Expert weights | Θ(1/n) | Θ(1/n) | As for hidden layers |
MoE Router weights | Θ(1/n²) | Θ(1) | As for output weights |
See the respective works for detailed derivations for each module (Yang et al., 2020, Yang et al., 2022, Lingle, 8 Apr 2024, Li et al., 24 Jun 2025, Małaśnicki et al., 13 Aug 2025).
Implementation requires careful assignment of scaling rules to each parameter group. PyTorch and Jax/Flax packages automate this (notably github.com/microsoft/mup). For transformers, attention scaling is modified from (standard) to $1/D$ to ensure proper width scaling (Lingle, 8 Apr 2024).
Adoption of μP presents some challenges: the formulas are more complex than standard initialization and require diligence in scaling per-parameter learning rates. However, this complexity is offset by hyperparameter transferability and scaling efficiency.
5. Empirical Characterization and Model Scaling
μP-driven scaling has been empirically validated in a range of settings:
- Few-shot learning (Omniglot via MAML): μP limit achieves meta-test accuracy of 66–69%, while NTK/GP baseline are near-random (Yang et al., 2020).
- Word2Vec pretraining: μP-parameterized networks learn evolving embeddings that support analogies and cluster semantically similar words, in contrast to NTK/GP where embeddings remain essentially random.
- Transformer and ResNet benchmarks: Models parametrized with μP showed that the optimal learning rate and related HPs are stable, and wider models always performed at least as well as narrower ones (Yang et al., 2022).
- PDEs with FNOs: μTransfer-FNO secured learning rate stability and test performance across large K, reducing tuning compute to 30% of the cost of a full sweep, while maintaining accuracy (Li et al., 24 Jun 2025).
- Mixture-of-Experts LLMs: μP ensures constant optimal learning rate with increasing model width and number of experts, with empirical transfer holding unless fine-grained (“granular”) expert scaling is introduced (Małaśnicki et al., 13 Aug 2025).
6. Extensions, Modern Variants, and Low-Precision Training
μP has been extended and refined in several newer directions:
- Unit-Scaled μP (u-μP) (Blake et al., 24 Jul 2024): Combines μP with unit scaling, ensuring that all activations, weights, and gradients are initialized to unit variance and remain so during training. This simple adjustment supports efficient low-precision training (such as FP8), and substantially reduces HP interdependencies—enabling nearly 1D hyperparameter sweeps and superior transfer at scale.
- Fourier Neural Operators and PDE solvers: μP scaling analyses now account for O() parameterizations, deriving an optimal scaling law as for the Fourier modes (Li et al., 24 Jun 2025).
- MoE architectures: μP theory prescribes distinct scaling for expert and router components, maintaining feature learning and HP invariance with respect to expert number (Małaśnicki et al., 13 Aug 2025).
Research points to persistent challenges, including compatibility with all optimizers (e.g., some issues with Lion compared to Adam (Lingle, 8 Apr 2024)), ensuring transferability with changing granularity in MoE, and best practices under strong regularization or non-standard normalization.
7. Theoretical and Practical Significance
The Maximal Update Parametrization provides a rigorous theoretical foundation and a set of scalable engineering prescriptions for deep learning at scale.
- Theoretical implications: μP provides (via Tensor Programs) a mathematical theory for nontrivial, globally convergent feature learning in the infinite-width limit—even typified by provable linear independence and richness of representations at all layers.
- Practical impact: μP and its descendants enable “tune once, transfer everywhere” for hyperparameters, drastically lowering the cost and complexity of scaling models, with demonstrated success in LLMs, computer vision, PDE surrogates, and more.
- Design philosophy: Properly chosen scaling rules unlock both convergence and expressivity in neural network training, where parameterization is not merely a technical detail but a key determinant of trainability and generalization.
Ongoing research seeks to unify μP with new optimization paradigms, extend to emergent architectures, and clarify how parametrization and data interact to shape large-scale learning. As neural network width and complexity continue their rapid ascent, μP’s centrality in robust, efficient, and theoretically grounded deep learning practice appears established (Yang et al., 2020, Yang et al., 2022, Lingle, 8 Apr 2024, Blake et al., 24 Jul 2024, Chen et al., 12 Mar 2025, Li et al., 24 Jun 2025, Małaśnicki et al., 13 Aug 2025).