Maximal Update Parametrization (μP)

Updated 24 December 2025

Maximal Update Parametrization (μP) is a framework that scales network initializations and learning updates to maintain constant signal norms regardless of layer width.
It enables direct hyperparameter transfer from smaller proxy models to larger architectures, overcoming the limitations of standard Xavier/NTK scaling methods.
μP adapts to various models—including MLPs, transformers, and Fourier Neural Operators—ensuring robust feature learning, convergence, and applicability to sparse and low-precision settings.

Maximal Update Parametrization (μP) is a framework for scaling neural network initialization and update rules such that the key signals—activations, gradients, and weight updates—remain independent of layer width or, more generally, system size. This ensures maximal feature-learning at any network width while enabling hyperparameters (HPs) discovered on proxy (small-width) models to transfer directly to arbitrarily large models without retuning. μP and its variants thus overcome the limitations of prevailing parameterizations, such as standard Xavier/He initialization or Neural Tangent Kernel (NTK) scaling, both of which either induce vanishing updates (lazy learning) or necessitate model-size-specific hyperparameter sweeps. μP plays a fundamental role in the design, analysis, and scaling of modern deep, wide, and sparse networks, as well as neural operators for scientific computing.

1. Theoretical Motivation and Regime Distinction

Traditional parameterizations such as Xavier or NTK scaling initialize weights as $W^\ell \sim \mathcal{N}(0, \sigma^2/n^\ell)$ and employ update rules $\Delta W^\ell \approx \eta \nabla_{W^\ell}$ that cause failure modes as layer width $n^\ell$ increases: feature maps may remain near initialization ("lazy" NTK regime) or feature-learning signal propagation is suppressed ( $O(1/n^\ell)$ updates) or diverges. In both cases, the optimal learning rate $\eta^*$ becomes width-dependent, often requiring laborious layerwise retuning.

μP is formulated to ensure that, as $n\to\infty$ , every layer's parameter update induces an $O(1)$ change in the pre-activations, preserving non-trivial feature learning. Specifically, μP employs an "abc-parametrization" for every weight tensor $W$ :

$w \sim \mathcal{N}(0, B_W^2)$ (initialization)
$W = A_W w$ (scaling for forward)
$w \leftarrow w + C_W\,\Phi(\nabla \ell)$ (update, e.g., SGD/Adam step)

The scaling constants $A_W$ , $B_W$ , $C_W$ are chosen to depend on layer width such that all signal and update norms remain $O(1)$ as the network grows. For standard fully connected networks, the μP regime for inner layers takes $A_W=1$ , $B_W \propto 1/\sqrt{n}$ , $C_W \propto 1/n$ to ensure both activations and update effects (on activations) are width-invariant (Blake et al., 2024, Ishikawa et al., 2024, Chen et al., 12 Mar 2025).

2. Canonical Scaling Laws and Hyperparameter Transfer

μP enables width-invariant scaling for initialization variance, learning rate, and (for certain optimizers such as second-order K-FAC/Shampoo) layerwise damping, ensuring that hyperparameters optimized on small (proxy) models transfer to models of arbitrary width. The transfer recipe entails:

Weight initialization variance and optimizer learning rate are inversely (or otherwise, per μP calculation) proportional to the relevant architectural parameter, such as width or, for FNOs, the number of retained Fourier modes (Li et al., 24 Jun 2025).
For dense networks, standard μP sets $W^\ell_{ij} \sim \mathcal{N}(0, \sigma_{\rm base}^2/m_d)$ and learning rate $\eta^\ell = \eta_{\rm base}/m_d$ for width multiplier $m_d = d^\ell/d^\ell_{\rm base}$ (Dey et al., 2024).
For Fourier Neural Operators, operator norm control leads to scaling as $b(K) = \Theta(1/\sqrt{d\log K})$ with learning rate $c(K)=\Theta(1/\sqrt{d\log K})$ to maintain bounded spectral norm and update magnitude as the number of retained Fourier modes $K$ grows (Li et al., 24 Jun 2025).

This width-invariant scaling is formalized in μ-transfer theorems: when μP-prescribed hyperparameters are used, the loss landscape with respect to HPs (LR, batch size, beta, etc.) stabilizes as model size increases, so the optimizer’s argmin on the proxy carries directly over to the target model.

3. Extensions: Unit-Scaled μP, Sparsity, and Local Learning

Unit-Scaled μP (u-μP): To further simplify training across precision formats and further standardize scale, u-μP combines μP with Unit Scaling so that all tensors (activations, weights, gradients) have initial RMS ≈ 1. For standard layers in transformers:

For a hidden layer $W^\ell \in \mathbb{R}^{n^{\ell-1} \times n^\ell}$ , $A_W=1/\sqrt{n^{\ell-1}}$ , $B_W=1$ , $C_W = \eta/\sqrt{n^{\ell-1}}$ .
For input/output embeddings, variant rules are chosen to keep all signal RMS at or near unity (Blake et al., 2024). This framework allows instant transition to low-precision (FP8) training without auxiliary scaling, as all relevant tensors are FP8-centered at initialization.

Sparse μP (SμPar): For unstructured sparse networks, the naïve application of μP leads to scale vanishing with sparsity level (density $\rho$ ). SμPar incorporates $\rho$ into both the initialization and learning rate: $W^\ell_{ij} \sim \mathcal{N}(0, \sigma_{\rm base}^2 / (m_d m_\rho))$ and $\eta^\ell = \eta_{\rm base} / (m_d m_\rho)$ with $m_\rho = \rho / \rho_{\rm base}$ . This ensures all signals remain $O(1)$ regardless of sparsity, facilitating hyperparameter transfer across both width and sparsity (Dey et al., 2024).

Local Learning Paradigms: μP provides theoretically principled scaling laws even for alternatives to backpropagation such as predictive coding and target propagation. For instance, in predictive coding, μP can yield gradient updates that interpolate smoothly between first-order and Gauss-Newton-like regimes, while for target propagation, μP eliminates the possibility of falling into the NTK kernel regime, enforcing feature learning in the infinite-width limit (Ishikawa et al., 2024).

4. Second-Order Optimization and Generalization

For second-order optimizers (K-FAC, Shampoo), a maximal update parameterization ensures non-lazy feature learning even at infinite width, overcoming limitations where second-order dynamics otherwise degenerate to purely kernel (NTK) behavior. Scaling exponents for initialization, learning rates, and damping must be adjusted per optimizer:

For K-FAC: $c_1 = e_B-1,\, c_\ell = e_B-e_A\,(1<\ell<L),\, c_L = 1-e_A$ , with proper layerwise damping scaling (Ishikawa et al., 2023).
For Shampoo, the analogous rules employ $e_A = e_B = 1/2$ . These parameterizations yield improved generalization metrics as model width increases and maintain the invariance of feature learning and convergence properties under increased scale.

5. Empirical Validation and Theoretical Guarantees

μP has received rigorous theoretical grounding and extensive empirical validation:

Feature evolution and non-degeneracy: Under μP, features at all layers remain linearly independent and deviate $O(1)$ from their initialization; Gram matrices do not lose rank, yielding rich representation learning (Chen et al., 12 Mar 2025).
Global convergence: For wide networks, μP guarantees that if SGD converges, it finds global minima (i.e., vanishing error signals correspond to optimal solutions) (Chen et al., 12 Mar 2025).
Hyperparameter transfer: Across dense, sparse, and operator-based architectures, μP enables "zero-shot" HP transfer from proxies to large models without accuracy loss (Dey et al., 2024, Li et al., 24 Jun 2025, Blake et al., 2024).
Low-precision stability: Empirical results demonstrate that u-μP remains robust under FP8 arithmetic, whereas standard μP and other parameterizations may suffer gradient underflow (Blake et al., 2024).
Sparse modeling: At extreme sparsities, SμPar outperforms standard and dense-optimal HPs, remaining on the Pareto frontier for tradeoffs between validation loss and sparsity (Dey et al., 2024).

6. Methodologies and Generalization Across Architectures

μP has been adapted for a range of architectures, including fully-connected MLPs, transformers, Fourier Neural Operators (μTransfer-FNO), and various optimizer families (SGD, Adam, K-FAC, Shampoo). The abc-parametrization principle and associated scaling rules are instantiated differently for each, but the core desideratum—invariant $O(1)$ signal and update norms—remains unchanged.

A table summarizing μP scaling for selected architectures appears below:

Model/Layer Type	Initialization Variance	Learning Rate Scaling
MLP, hidden layer	$\propto 1/n$	$\propto 1/n$
FNO, Fourier modes	$\propto 1/(d\log K)$	$\propto 1/\sqrt{d\log K}$
Sparse (SμPar)	$\propto 1/(m_d m_\rho)$	$\propto 1/(m_d m_\rho)$
u-μP (unit scale)	$=1$ (all tensors)	$\propto 1/\sqrt{n}$ or 1
K-FAC (SGD)	layer-dependent	layer-dependent

In each context, μP scaling is mathematically justified as necessary and sufficient for both stable initialization and maximal feature evolution.

7. Limitations, Open Directions, and Impact

Current μP theory is rigorous in the infinite-width limit but relies on empirical extension for deep, narrow, or dynamically sparse regimes. Limitations include:

Dynamic sparsity: Existing SμPar theory handles static masks; dynamic/mask evolution requires analysis accounting for weight-dependent structure (Dey et al., 2024).
Hardware maturity: SμPar's utility for unstructured sparsity depends on advances in hardware support.
Fine-grained depth scaling and finite-width corrections remain active research questions (Ishikawa et al., 2023).
Proofs beyond first-step updates for certain optimizers remain open (Ishikawa et al., 2023).

The practical impact of μP is substantial in reducing the computational burden of hyperparameter search, enabling rapid scaling of new architectures, and improving the viability of sparse, wide, or low-precision neural models for scientific, industrial, and foundational AI applications (Blake et al., 2024, Li et al., 24 Jun 2025, Dey et al., 2024).