Papers
Topics
Authors
Recent
Search
2000 character limit reached

Maximal Update Parameterization

Updated 16 January 2026
  • Maximal Update Parameterization is a method that reparameterizes neural network training so that weight updates remain O(1) even as model size increases.
  • Its abc-parameterization framework ensures that tuning hyperparameters on small proxy models directly transfers to larger architectures with minimal optimality loss.
  • Extensions like u-μP and SμPar adapt μP for diverse scenarios including operator networks and sparse regimes, yielding significant computational and performance gains.

Maximal Update Parameterization

Maximal Update Parameterization (μP) is a systematic scheme for reparameterizing neural network training such that model optimization hyperparameters—including learning rates and initialization scales—can be tuned on a small proxy model and directly transferred to arbitrarily larger, deeper, or sparser models with minimal loss of optimality. The defining principle is that, in the infinite-width (and more generally, infinite size) limit, weight updates in every layer remain order-one (O(1)), guaranteeing nontrivial and stable feature learning across layers while avoiding regimes where signals either vanish (“lazy/NTK”) or explode. μP is formulated through the so-called abc-parametrization of weights and updates, providing a unified foundation for width-, depth-, or modality-scaling across dense, sparse, operator-based, and local learning regimes (Blake et al., 2024, Li et al., 24 Jun 2025, Yaida, 2022, Ishikawa et al., 2023, Ishikawa et al., 2024, Dey et al., 2024).

1. Foundations and abc-Parameterization

At the core of μP is the abc-parametrization, whereby every weight tensor WW in layer \ell is represented as:

  • Raw parameterization: w0N(0,BW2)w_0 \sim \mathcal{N}(0, B_W^2)
  • Weight scaling: Wt=AWwtW_t = A_W \cdot w_t
  • Gradient update scaling: wt+1=wt+CWΦt(L0,,Lt)w_{t+1} = w_t + C_W \cdot \Phi_t(\nabla \mathcal{L}_0, \ldots, \nabla \mathcal{L}_t)

The triple (AW,BW,CW)(A_W, B_W, C_W) is specified with explicit dependence on the width, depth, or kernel size, and their associated scaling rules aWa_W, bWb_W, and cWc_W are selected to ensure two properties:

  • Feature-learning maximality: Each layer’s updates have the same order in model size, preventing “feature freezing”
  • Hyperparameter transferability: Optimal η\eta, σ\sigma, etc., identified on a small proxy reliably extend to large-scale models

In typical Transformer-style blocks, these scalings take the form: | Tensor Type | aWa_W (param) | bWb_W (init) | cWc_W (LR) | |------------------|---------------|--------------|------------| | Input weights | 1 | 1 | 1 | | Hidden weights | 1 | 1/fanin1/\sqrt{\mathrm{fan_{in}}} | 1/fanin1/\mathrm{fan_{in}} | | Output weights | 1/fanin1/\mathrm{fan_{in}} | 1 | 1 | | Residual branch | base/depth\sqrt{\mathrm{base}/\mathrm{depth}} | — | — |

This guarantees that ΔW2=O(1)\|\Delta W\|_2 = O(1) regardless of width or depth (Blake et al., 2024, Yaida, 2022).

2. Mathematical Formulation and Infinite-Width Regime

μP formalizes these rules by setting scaling exponents (al,bl,cl)(a_l, b_l, c_l) per layer (often fixing al=0a_l=0) in such a way that:

  • Initializations and LRs obey

Wl=wlMal,wlN(0,σ2/M2bl)W_l = \frac{w_l}{M^{a_l}},\quad w_l \sim \mathcal{N}(0, \sigma'^2/M^{2b_l})

ηl=ηlMcl\eta_l = \frac{\eta'_l}{M^{c_l}}

  • “Maximal Update” ensures

ΔWlhl1=Θ(1)\Delta W_l h_{l-1} = \Theta(1) for all ll

  • Readout layer ensures similarly for output features

For standard backpropagation, the canonical choices are: b1=0b_1=0, b2L1=1/2b_{2\ldots L-1}=1/2, bL=1b_L=1; c1=1c_1=-1, c2L1=0c_{2\ldots L-1}=0, cL=1c_L=1.

This structure enforces all ΔWlhl1=O(1)\Delta W_l h_{l-1} = O(1), achieving maximal feature learning and stable forward/backward signal propagation even as MM \to \infty (Yaida, 2022, Ishikawa et al., 2024).

3. Extensions: Unit Scaling and u-μP

Unit Scaling is an independent scheme whereby all tensors—activations, weights, gradients—are initialized to unit variance, keeping their values near 1.0 for efficient floating-point representation. u-μP is the fusion of μP and Unit Scaling:

  • All initializations BW=1B_W=1 (unit variance)
  • All matmuls use 1/fanin1/\sqrt{\mathrm{fan_{in}}}
  • Adam learning-rate multipliers CWC_W reduced to a single global η\eta; i.e., no separate per-tensor η^\hat{\eta} needed

Typical default hyperparameters in u-μP (each default to 1 unless noted) include global η\eta, αattn\alpha_\mathrm{attn}, αact\alpha_\mathrm{act}, αres\alpha_\mathrm{res}, αloss\alpha_\mathrm{loss}. In empirical studies, setting all α=1\alpha=1 yields losses within 1% of the best achievable, and hyperparameter sweeps become independent and trivially efficient (Blake et al., 2024).

4. Algorithmic and Practical Consequences

μP and its variants enable robust hyperparameter transfer:

  1. Tune {η,σ,αi}\{\eta, \sigma, \alpha_i\} on a small proxy (width/density/size reduced by $4$–8×8\times)
  2. For full-scale or production deployment, reuse tuned {η,α}\{\eta^*, \alpha^*\}, without retuning
  3. For u-μP, a single 1D sweep over η\eta suffices for near-optimality even at extreme width/depth/batch scaling

In sparse regimes, SμPar generalizes μP to guarantee activation/gradient/update norms invariant to both width and sparsity. Initialization standard deviations and LRs are reparameterized by width and density multipliers (md,mρ)(m_d, m_\rho) so that optimal HPs can be tuned once and transferred to settings with arbitrary sparsity. SμPar defines the Pareto frontier of validation loss with no per-sparsity retuning (Dey et al., 2024).

5. Applications: Operator Networks, Local Learning, and Second-Order Optimization

μP has been rigorously extended beyond standard dense networks:

  • Operator networks: For Fourier Neural Operators (FNOs), μP’s scaling for the number of modes KK is b(K)=c(K)=Θ(1/dlogK)b(K) = c(K) = \Theta(1/\sqrt{d \log K}), providing zero-shot hyperparameter transfer for billion-parameter models (Li et al., 24 Jun 2025).
  • Local learning: In predictive coding (PC) and target propagation (TP), μP ensures feature learning in the infinite-width limit even where local gradient rules deviate from standard BP; PC interpolates between first-order and Gauss-Newton gradients depending on parameterization, while TP eliminates the lazy kernel regime entirely (Ishikawa et al., 2024).
  • Second-order methods: μP generalizations for K-FAC and Shampoo derive layerwise initializations, learning-rate scalings, and damping rules for high-width regimes. Hyperparameter transfer and feature learning are stabilized compared to PyTorch defaults, with increased generalization performance as model scale grows (Ishikawa et al., 2023).
  • SAM and perturbation-based optimization: μP2^2 defines the unique layerwise SAM radius scaling maintaining effective perturbation and learning in all layers as width increases, contrasting with standard SAM's last-layer-only effect. The joint (η,ρ)(\eta^*, \rho^*) optimum for learning-rate and SAM radius is width-invariant (Haas et al., 2024).

6. Training Dynamics, Hyperparameter Transfer, and Empirical Outcomes

Under μP, feature learning remains active: signals neither vanish nor explode, and every layer moves at O(1) per update—a sharp contrast to NTK/lazy regimes, where only the output layer evolves. Further, in u-μP, mutual independence of hyperparameters means that individual sweeps can be combined without retuning—a property not shared by vanilla μP. Empirical studies show stable losses, improved generalization, and low-precision training compatibility (FP8/BF16) out-of-the-box in u-μP, where traditional μP may underflow (Blake et al., 2024).

On large language modeling tasks, SμPar yields up to 11.9%11.9\% relative loss improvement at extreme sparsity (99.2%) and up to 4.1×4.1\times compute efficiency gain over standard parameterizations. For operator networks, μTransfer-FNO maintains test error while reducing compute cost to 0.30×0.30\times the cost of direct tuning on the full-scale model. In second-order optimization, μP’s prescription ensures joint transferability of learning rate and damping parameters and avoids pathological dynamics observed in SP or uniform parametrizations (Dey et al., 2024, Li et al., 24 Jun 2025, Ishikawa et al., 2023).

7. Generalizations, Limitations, and Theoretical Insights

μP forms the “maximal update/mean-field” endpoint (s=1s=1) in a one-parameter meta-family of infinite-width scaling strategies, interpolating between NTK (s=0s=0) and coupled limits ($0Ln1sL \sim n^{1-s} and robust representation learning when s=1s=1 (Yaida, 2022). The abc-symmetry provides a mathematical invariance, allowing redistribution of scale among AWA_W, BWB_W, CWC_W without affecting optimization or forward activations.

The μP paradigm is generic:

  • It applies wherever model size (width, sparsity, number of modes/operators) increases systematically
  • The only caveat is delicate handling of residual branches and operator layers, which may require nontrivial spectral-norm scaling analysis (Li et al., 24 Jun 2025)
  • For very large architectures or alternative update rules (e.g., local or operator-based learning), explicit verification of O(1) update and signal norms may be required, following the detailed scaling laws for each structure

In sum, Maximal Update Parameterization provides a principled, scalable framework for hyperparameter tuning and transfer, ensuring robust feature-learning and stable dynamics across the entire spectrum of modern neural network families (Blake et al., 2024, Li et al., 24 Jun 2025, Dey et al., 2024, Yaida, 2022, Ishikawa et al., 2024, Ishikawa et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Maximal Update Parameterization.