Maximal Update Parameterization

Updated 16 January 2026

Maximal Update Parameterization is a method that reparameterizes neural network training so that weight updates remain O(1) even as model size increases.
Its abc-parameterization framework ensures that tuning hyperparameters on small proxy models directly transfers to larger architectures with minimal optimality loss.
Extensions like u-μP and SμPar adapt μP for diverse scenarios including operator networks and sparse regimes, yielding significant computational and performance gains.

Maximal Update Parameterization (μP) is a systematic scheme for reparameterizing neural network training such that model optimization hyperparameters—including learning rates and initialization scales—can be tuned on a small proxy model and directly transferred to arbitrarily larger, deeper, or sparser models with minimal loss of optimality. The defining principle is that, in the infinite-width (and more generally, infinite size) limit, weight updates in every layer remain order-one (O(1)), guaranteeing nontrivial and stable feature learning across layers while avoiding regimes where signals either vanish (“lazy/NTK”) or explode. μP is formulated through the so-called abc-parametrization of weights and updates, providing a unified foundation for width-, depth-, or modality-scaling across dense, sparse, operator-based, and local learning regimes (Blake et al., 2024, Li et al., 24 Jun 2025, Yaida, 2022, Ishikawa et al., 2023, Ishikawa et al., 2024, Dey et al., 2024).

1. Foundations and abc-Parameterization

At the core of μP is the abc-parametrization, whereby every weight tensor $W$ in layer $\ell$ is represented as:

Raw parameterization: $w_0 \sim \mathcal{N}(0, B_W^2)$
Weight scaling: $W_t = A_W \cdot w_t$
Gradient update scaling: $w_{t+1} = w_t + C_W \cdot \Phi_t(\nabla \mathcal{L}_0, \ldots, \nabla \mathcal{L}_t)$

The triple $(A_W, B_W, C_W)$ is specified with explicit dependence on the width, depth, or kernel size, and their associated scaling rules $a_W$ , $b_W$ , and $c_W$ are selected to ensure two properties:

Feature-learning maximality: Each layer’s updates have the same order in model size, preventing “feature freezing”
Hyperparameter transferability: Optimal $\eta$ , $\sigma$ , etc., identified on a small proxy reliably extend to large-scale models

In typical Transformer-style blocks, these scalings take the form: | Tensor Type | $a_W$ (param) | $b_W$ (init) | $c_W$ (LR) | |------------------|---------------|--------------|------------| | Input weights | 1 | 1 | 1 | | Hidden weights | 1 | $1/\sqrt{\mathrm{fan_{in}}}$ | $1/\mathrm{fan_{in}}$ | | Output weights | $1/\mathrm{fan_{in}}$ | 1 | 1 | | Residual branch | $\sqrt{\mathrm{base}/\mathrm{depth}}$ | — | — |

This guarantees that $\|\Delta W\|_2 = O(1)$ regardless of width or depth (Blake et al., 2024, Yaida, 2022).

2. Mathematical Formulation and Infinite-Width Regime

μP formalizes these rules by setting scaling exponents $(a_l, b_l, c_l)$ per layer (often fixing $a_l=0$ ) in such a way that:

Initializations and LRs obey

$W_l = \frac{w_l}{M^{a_l}},\quad w_l \sim \mathcal{N}(0, \sigma'^2/M^{2b_l})$

$\eta_l = \frac{\eta'_l}{M^{c_l}}$

“Maximal Update” ensures

$\Delta W_l h_{l-1} = \Theta(1)$ for all $l$

Readout layer ensures similarly for output features

For standard backpropagation, the canonical choices are: $b_1=0$ , $b_{2\ldots L-1}=1/2$ , $b_L=1$ ; $c_1=-1$ , $c_{2\ldots L-1}=0$ , $c_L=1$ .

This structure enforces all $\Delta W_l h_{l-1} = O(1)$ , achieving maximal feature learning and stable forward/backward signal propagation even as $M \to \infty$ (Yaida, 2022, Ishikawa et al., 2024).

3. Extensions: Unit Scaling and u-μP

Unit Scaling is an independent scheme whereby all tensors—activations, weights, gradients—are initialized to unit variance, keeping their values near 1.0 for efficient floating-point representation. u-μP is the fusion of μP and Unit Scaling:

All initializations $B_W=1$ (unit variance)
All matmuls use $1/\sqrt{\mathrm{fan_{in}}}$
Adam learning-rate multipliers $C_W$ reduced to a single global $\eta$ ; i.e., no separate per-tensor $\hat{\eta}$ needed

Typical default hyperparameters in u-μP (each default to 1 unless noted) include global $\eta$ , $\alpha_\mathrm{attn}$ , $\alpha_\mathrm{act}$ , $\alpha_\mathrm{res}$ , $\alpha_\mathrm{loss}$ . In empirical studies, setting all $\alpha=1$ yields losses within 1% of the best achievable, and hyperparameter sweeps become independent and trivially efficient (Blake et al., 2024).

4. Algorithmic and Practical Consequences

μP and its variants enable robust hyperparameter transfer:

Tune $\{\eta, \sigma, \alpha_i\}$ on a small proxy (width/density/size reduced by $4$– $8\times$ )
For full-scale or production deployment, reuse tuned $\{\eta^*, \alpha^*\}$ , without retuning
For u-μP, a single 1D sweep over $\eta$ suffices for near-optimality even at extreme width/depth/batch scaling

In sparse regimes, SμPar generalizes μP to guarantee activation/gradient/update norms invariant to both width and sparsity. Initialization standard deviations and LRs are reparameterized by width and density multipliers $(m_d, m_\rho)$ so that optimal HPs can be tuned once and transferred to settings with arbitrary sparsity. SμPar defines the Pareto frontier of validation loss with no per-sparsity retuning (Dey et al., 2024).

5. Applications: Operator Networks, Local Learning, and Second-Order Optimization

μP has been rigorously extended beyond standard dense networks:

Operator networks: For Fourier Neural Operators (FNOs), μP’s scaling for the number of modes $K$ is $b(K) = c(K) = \Theta(1/\sqrt{d \log K})$ , providing zero-shot hyperparameter transfer for billion-parameter models (Li et al., 24 Jun 2025).
Local learning: In predictive coding (PC) and target propagation (TP), μP ensures feature learning in the infinite-width limit even where local gradient rules deviate from standard BP; PC interpolates between first-order and Gauss-Newton gradients depending on parameterization, while TP eliminates the lazy kernel regime entirely (Ishikawa et al., 2024).
Second-order methods: μP generalizations for K-FAC and Shampoo derive layerwise initializations, learning-rate scalings, and damping rules for high-width regimes. Hyperparameter transfer and feature learning are stabilized compared to PyTorch defaults, with increased generalization performance as model scale grows (Ishikawa et al., 2023).
SAM and perturbation-based optimization: μP $^2$ defines the unique layerwise SAM radius scaling maintaining effective perturbation and learning in all layers as width increases, contrasting with standard SAM's last-layer-only effect. The joint $(\eta^*, \rho^*)$ optimum for learning-rate and SAM radius is width-invariant (Haas et al., 2024).

6. Training Dynamics, Hyperparameter Transfer, and Empirical Outcomes

Under μP, feature learning remains active: signals neither vanish nor explode, and every layer moves at O(1) per update—a sharp contrast to NTK/lazy regimes, where only the output layer evolves. Further, in u-μP, mutual independence of hyperparameters means that individual sweeps can be combined without retuning—a property not shared by vanilla μP. Empirical studies show stable losses, improved generalization, and low-precision training compatibility (FP8/BF16) out-of-the-box in u-μP, where traditional μP may underflow (Blake et al., 2024).

On large language modeling tasks, SμPar yields up to $11.9\%$ relative loss improvement at extreme sparsity (99.2%) and up to $4.1\times$ compute efficiency gain over standard parameterizations. For operator networks, μTransfer-FNO maintains test error while reducing compute cost to $0.30\times$ the cost of direct tuning on the full-scale model. In second-order optimization, μP’s prescription ensures joint transferability of learning rate and damping parameters and avoids pathological dynamics observed in SP or uniform parametrizations (Dey et al., 2024, Li et al., 24 Jun 2025, Ishikawa et al., 2023).

7. Generalizations, Limitations, and Theoretical Insights

μP forms the “maximal update/mean-field” endpoint ( $s=1$ ) in a one-parameter meta-family of infinite-width scaling strategies, interpolating between NTK ( $s=0$ ) and coupled limits ($0 $L \sim n^{1-s}$

The μP paradigm is generic:

It applies wherever model size (width, sparsity, number of modes/operators) increases systematically
The only caveat is delicate handling of residual branches and operator layers, which may require nontrivial spectral-norm scaling analysis (Li et al., 24 Jun 2025)
For very large architectures or alternative update rules (e.g., local or operator-based learning), explicit verification of O(1) update and signal norms may be required, following the detailed scaling laws for each structure

In sum, Maximal Update Parameterization provides a principled, scalable framework for hyperparameter tuning and transfer, ensuring robust feature-learning and stable dynamics across the entire spectrum of modern neural network families (Blake et al., 2024, Li et al., 24 Jun 2025, Dey et al., 2024, Yaida, 2022, Ishikawa et al., 2024, Ishikawa et al., 2023).

Markdown Upgrade to Chat

References (7)

u-$μ$P: The Unit-Scaled Maximal Update Parametrization (2024)

Maximal Update Parametrization and Zero-Shot Hyperparameter Transfer for Fourier Neural Operators (2025)

Meta-Principled Family of Hyperparameter Scaling Strategies (2022)

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width (2023)

Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation (2024)

Sparse maximal update parameterization: A holistic approach to sparse training dynamics (2024)

μP$^2$: Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Maximal Update Parameterization.

Maximal Update Parameterization

1. Foundations and abc-Parameterization

2. Mathematical Formulation and Infinite-Width Regime

3. Extensions: Unit Scaling and u-μP

4. Algorithmic and Practical Consequences

5. Applications: Operator Networks, Local Learning, and Second-Order Optimization

6. Training Dynamics, Hyperparameter Transfer, and Empirical Outcomes

7. Generalizations, Limitations, and Theoretical Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics