Papers
Topics
Authors
Recent
2000 character limit reached

Maximal-Update Parameterization (μP)

Updated 31 December 2025
  • Maximal-Update Parameterization (μP) is a neural network framework that defines scaling rules for initialization, learning rates, and output weights to ensure O(1) feature updates as network width increases.
  • It guarantees width-invariant optimization by converging to a nonzero optimal learning rate, enabling zero-shot hyperparameter transfer from small proxy models to full-scale networks.
  • Empirical benchmarks show μP outperforms standard and NTK parameterizations by maintaining stable, nontrivial feature learning across deep neural networks.

Maximal-Update Parameterization (μP) is a principled neural network parameterization framework engineered to ensure stable and nontrivial feature learning at arbitrary width. μP uniquely prescribes how layerwise initialization, scaling of output weights, and learning rates are chosen, such that the optimal learning rate converges to a nonzero constant as network width increases, enabling zero-shot hyperparameter transfer from small proxy models to full-scale networks and ensuring robust, width-invariant optimization. μP stands in contrast to standard and Neural Tangent Parameterizations, which respectively suffer from vanishing or exploding learning rates and feature updates in the infinite-width limit (Hayou, 3 Nov 2025).

1. Formal Definition and Scaling Rules

The Maximal-Update Parameterization is specified for deep multi-layer perceptrons as follows. Consider an LL-layer MLP,

f(x)=VTWLW1W0xf(x) = V^T W_L \cdots W_1 W_0 x

where each hidden layer WRn×nW_\ell \in \mathbb{R}^{n \times n} has width nn, W0Rn×dW_0 \in \mathbb{R}^{n \times d} is the input layer, and VRnV \in \mathbb{R}^n is the output readout. In μP, one chooses scaling exponents α0,,αL,αV,αη\alpha_0, \ldots, \alpha_L, \alpha_V, \alpha_\eta such that:

  • Initializations:
    • W0ijN(0,dα0)W_0^{ij} \sim \mathcal{N}(0, d^{-\alpha_0}), α0=1\alpha_0 = 1
    • WijN(0,nα)W_\ell^{ij} \sim \mathcal{N}(0, n^{-\alpha_\ell}), α=1\alpha_\ell = 1 for =1,,L\ell=1,\ldots,L
    • ViN(0,nαV)V^i \sim \mathcal{N}(0, n^{-\alpha_V}), αV=2\alpha_V = 2
  • Learning rate:
    • For full-batch gradient descent: η=O(1)\eta = O(1) (αη=0\alpha_\eta = 0)
    • For Adam: ηn1\eta n^{-1} (αη=1\alpha_\eta = 1)

Equivalently, one can write for 1\ell \ge 1:

  • Wn1/2W~W_\ell \sim n^{-1/2} \widetilde W_\ell, Vn1V~V \sim n^{-1} \widetilde V, ηn0\eta \sim n^0 (for GD)

A critical μP prescription is the “mean-field” scaling of the readout: one power smaller than standard (αV=2\alpha_V = 2 rather than $1$) (Hayou, 3 Nov 2025).

2. Theoretical Guarantees for Learning Rate Transfer

μP is characterized by rigorous theorems establishing the convergence of the optimal learning rate ηn(t)\eta_n^{(t)} with width nn:

Theorem 2.1: One-Step Transfer

With input Gram matrix KK and target vector yy such that Ky0Ky \neq 0, the unique minimizer of the infinite-width one-step loss

L(1)(η)=12my+η(L/m)Ky2L_\infty^{(1)}(\eta)=\frac{1}{2m}\| -y+\eta (L/m)K y \|^2

is given by

η(1)=mLyTKyKy2>0\eta_\infty^{(1)} = \frac{m}{L}\frac{y^T K y}{\| K y \|^2} > 0

and for any compact interval containing η(1)\eta_\infty^{(1)},

ηn(1)η(1)=OP(n1/2)\eta_n^{(1)} - \eta_\infty^{(1)} = O_P(n^{-1/2})

Theorem 2.2: tt-Step Transfer

For any fixed number of steps tt, the empirical loss curve Ln(t)(η)L_n^{(t)}(\eta) uniformly converges to a deterministic limit L(t)(η)L_\infty^{(t)}(\eta), whose minimizer η(t)>0\eta_\infty^{(t)} > 0 is the limit of the finite-width optimizers.

Standard Parametrization and Neural Tangent Parameterization fail this transfer property: under SP, the optimal learning rate shrinks toward zero as nn grows; under NTK scaling, the optimum diverges (Hayou, 3 Nov 2025).

3. Mechanistic Rationale: Feature Updates and Infinite-Width Dynamics

In μP, every pre- and post-activation change Δz\Delta z per gradient step remains Θ(1)\Theta(1) as nn \to \infty. Contrastingly, SP forces Δz0\Delta z \to 0 (lazy/NTK regime) or causes updates to explode (unstable regime). The core mechanistic insight is that μP “maximizes” feature updates—a fully nonlinear mean-field-type dynamics in which features adapt and evolve with width, maintaining nontrivial feature learning (Hayou, 3 Nov 2025). This is achieved by scaling the output-layer variance and learning rate to guarantee that each gradient update induces an O(1)O(1) change to activations throughout training as width increases.

4. Empirical Evidence and Benchmarks

Extensive experiments using synthetic datasets (dimension d=100d=100, N=1000N=1000 samples), over MLPs with depths L{3,9,27}L \in \{3, 9, 27\} and widths n{128,,8192}n \in \{128, \ldots, 8192\}, demonstrate:

  • Under μP, loss-vs-learning-rate curves for different widths overlap tightly, and the optimal learning rate converges rapidly to a nonzero value, with error decaying at O(n1/2)O(n^{-1/2}).
  • Standard Parametrization yields optima that shrink toward zero with increasing width and misaligned loss curves.
  • μP with Adam exhibits stable width-invariance of optimal learning rates even in very deep nets (L=27L=27).
  • Extended training widens the basin of near-optimal η\eta, but the location of the optimum remains invariant (Hayou, 3 Nov 2025).

5. Practical Implementation: Guidelines for Hyperparameter Tuning

μP provides a robust workflow for practitioners:

  • Tune the learning rate η\eta on a proxy model with moderate width (n0=256 ⁣ ⁣1024n_0 = 256\!-\!1024) via logarithmic grid sweep to locate the optimal basin.
  • For full width nn0n \gg n_0, reuse the same η\eta with negligible retuning—performance changes are typically under 20%20\% even at 10×10\times scale.
  • For Adam, scale the learning rate by nn as ηactual=ηgrid/n\eta_\text{actual}=\eta_\text{grid}/n.
  • When increasing depth LL, theory and empirics suggest tuning η\eta near O(1/L)O(1/L), as the infinite-width optimum is proportional to $1/L$.
  • The prescription ensures width-robust transferability and zero-shot deployment for hyperparameters as model size increases (Hayou, 3 Nov 2025).

6. Broader Implications: Feature Learning and Parameterization Selection

The maximal-update property of μP fundamentally preserves nontrivial feature adaptation, enabling mean-field, nonlinear learning dynamics and edge-of-stability phenomena as observed in wide deep nets. Unlike SP and NTP, which respectively drive networks into lazy/NTK or unstable regimes, μP is the unique parameterization (within the abc-framework and linear MLP analysis) that yields both stable infinite-width learning rates and maximal feature updates. This regime is theorized to capture essential behaviors of practical deep networks, while freeing practitioners from costly, width-specific hyperparameter retuning (Hayou, 3 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Maximal-Update Parameterization (MuP).