Maximal-Update Parameterization (μP)
- Maximal-Update Parameterization (μP) is a neural network framework that defines scaling rules for initialization, learning rates, and output weights to ensure O(1) feature updates as network width increases.
- It guarantees width-invariant optimization by converging to a nonzero optimal learning rate, enabling zero-shot hyperparameter transfer from small proxy models to full-scale networks.
- Empirical benchmarks show μP outperforms standard and NTK parameterizations by maintaining stable, nontrivial feature learning across deep neural networks.
Maximal-Update Parameterization (μP) is a principled neural network parameterization framework engineered to ensure stable and nontrivial feature learning at arbitrary width. μP uniquely prescribes how layerwise initialization, scaling of output weights, and learning rates are chosen, such that the optimal learning rate converges to a nonzero constant as network width increases, enabling zero-shot hyperparameter transfer from small proxy models to full-scale networks and ensuring robust, width-invariant optimization. μP stands in contrast to standard and Neural Tangent Parameterizations, which respectively suffer from vanishing or exploding learning rates and feature updates in the infinite-width limit (Hayou, 3 Nov 2025).
1. Formal Definition and Scaling Rules
The Maximal-Update Parameterization is specified for deep multi-layer perceptrons as follows. Consider an -layer MLP,
where each hidden layer has width , is the input layer, and is the output readout. In μP, one chooses scaling exponents such that:
- Initializations:
- ,
- , for
- ,
- Learning rate:
- For full-batch gradient descent: ()
- For Adam: ()
Equivalently, one can write for :
- , , (for GD)
A critical μP prescription is the “mean-field” scaling of the readout: one power smaller than standard ( rather than $1$) (Hayou, 3 Nov 2025).
2. Theoretical Guarantees for Learning Rate Transfer
μP is characterized by rigorous theorems establishing the convergence of the optimal learning rate with width :
Theorem 2.1: One-Step Transfer
With input Gram matrix and target vector such that , the unique minimizer of the infinite-width one-step loss
is given by
and for any compact interval containing ,
Theorem 2.2: -Step Transfer
For any fixed number of steps , the empirical loss curve uniformly converges to a deterministic limit , whose minimizer is the limit of the finite-width optimizers.
Standard Parametrization and Neural Tangent Parameterization fail this transfer property: under SP, the optimal learning rate shrinks toward zero as grows; under NTK scaling, the optimum diverges (Hayou, 3 Nov 2025).
3. Mechanistic Rationale: Feature Updates and Infinite-Width Dynamics
In μP, every pre- and post-activation change per gradient step remains as . Contrastingly, SP forces (lazy/NTK regime) or causes updates to explode (unstable regime). The core mechanistic insight is that μP “maximizes” feature updates—a fully nonlinear mean-field-type dynamics in which features adapt and evolve with width, maintaining nontrivial feature learning (Hayou, 3 Nov 2025). This is achieved by scaling the output-layer variance and learning rate to guarantee that each gradient update induces an change to activations throughout training as width increases.
4. Empirical Evidence and Benchmarks
Extensive experiments using synthetic datasets (dimension , samples), over MLPs with depths and widths , demonstrate:
- Under μP, loss-vs-learning-rate curves for different widths overlap tightly, and the optimal learning rate converges rapidly to a nonzero value, with error decaying at .
- Standard Parametrization yields optima that shrink toward zero with increasing width and misaligned loss curves.
- μP with Adam exhibits stable width-invariance of optimal learning rates even in very deep nets ().
- Extended training widens the basin of near-optimal , but the location of the optimum remains invariant (Hayou, 3 Nov 2025).
5. Practical Implementation: Guidelines for Hyperparameter Tuning
μP provides a robust workflow for practitioners:
- Tune the learning rate on a proxy model with moderate width () via logarithmic grid sweep to locate the optimal basin.
- For full width , reuse the same with negligible retuning—performance changes are typically under even at scale.
- For Adam, scale the learning rate by as .
- When increasing depth , theory and empirics suggest tuning near , as the infinite-width optimum is proportional to $1/L$.
- The prescription ensures width-robust transferability and zero-shot deployment for hyperparameters as model size increases (Hayou, 3 Nov 2025).
6. Broader Implications: Feature Learning and Parameterization Selection
The maximal-update property of μP fundamentally preserves nontrivial feature adaptation, enabling mean-field, nonlinear learning dynamics and edge-of-stability phenomena as observed in wide deep nets. Unlike SP and NTP, which respectively drive networks into lazy/NTK or unstable regimes, μP is the unique parameterization (within the abc-framework and linear MLP analysis) that yields both stable infinite-width learning rates and maximal feature updates. This regime is theorized to capture essential behaviors of practical deep networks, while freeing practitioners from costly, width-specific hyperparameter retuning (Hayou, 3 Nov 2025).