Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Maximal Update Parameterization (μP)

Updated 21 August 2025
  • Maximal Update Parameterization (μP) is a framework that maintains O(1) feature updates in deep networks even as the width tends to infinity.
  • It employs an abc-parametrization that scales weights, initialization variances, and learning rates to differentiate feature-learning from kernel regimes.
  • μP enables effective hyperparameter transfer and improved performance in tasks like pretraining and transfer learning compared to traditional NTK methods.

Maximal Update Parameterization (μP) is a theoretically derived parameterization framework for deep neural networks that prescribes principled scaling rules for weights, initialization variances, and learning rates to maintain order-one (Θ(1)) feature updates in the infinite-width limit. μP defines a sharp dynamical distinction between “feature-learning” and “kernel” regimes by determining whether the internal representations (features) can evolve as width increases, thereby enabling robust transfer learning, zero-shot hyperparameter transfer, and scalable model design.

1. Motivation: Feature Learning in the Infinite-Width Limit

In the conventional scaling regimes—standard or Neural Tangent Kernel (NTK) parameterizations—the infinite-width limit leads to “kernel” dynamics where hidden features remain fixed during training and network outputs evolve linearly around their initialization. Specifically, in a multilayer perceptron (MLP) parameterized with nn-width hidden layers, standard scaling makes the first-layer feature updates Δx=O(1/n)\Delta x = O(1/\sqrt{n}), so as nn \to \infty, features are frozen and only the readout changes. This regime is analytically tractable but fails to model the dynamics required for tasks needing adaptation of internal representations, such as pretraining and transfer learning (e.g., BERT).

μP addresses this by constructing a scaling in which every parameter update remains as large as possible—maximal—without destabilizing the learning process. This enables nontrivial evolution of hidden features in the infinite-width limit, supporting the internal feature learning critical for pretraining and transfer-based tasks. The μP parameterization is defined in the “abc-parametrization” formalism:

W=naw,wαβN(0,n2b),η=base learning ratencW^\ell = n^{-a_\ell} w^\ell, \qquad w^\ell_{\alpha\beta} \sim \mathcal{N}(0, n^{-2b_\ell}), \qquad \eta = \text{base learning rate} \cdot n^{-c}

The specific exponents under μP for LL-layer MLPs are:

a1=12,a=0  (2L),aL+1=12,b=12  ,c=0a_1 = -\tfrac{1}{2}, \quad a_\ell = 0 \ \ (2 \leq \ell \leq L), \quad a_{L+1} = \tfrac{1}{2}, \qquad b_\ell = \tfrac{1}{2}\ \forall\ \ell, \qquad c=0

By boosting the first layer and attenuating the final layer (as encoded in aa_\ell), feature changes are ensured to remain O(1)O(1) as width increases.

2. Tensor Programs: Rigorous Infinite-Width Analysis

The Tensor Programs technique enables formal, recursive computation of the deterministic infinite-width evolution of deep networks under general parameterizations. This approach models the forward and backward passes as a computation graph composed of matrix multiplications, coordinate-wise nonlinearities, and averaging operations.

The key structural result, the Master Theorem, ensures that for any activation vector hRnh \in \mathbb{R}^n whose coordinates converge in law to a random variable ZhZ^h, and any pseudo-Lipschitz function ψ\psi,

1nα=1nψ(hα)E[ψ(Zh)]\frac{1}{n} \sum_{\alpha=1}^n \psi(h_\alpha) \to \mathbb{E}[\psi(Z^h)]

as nn \rightarrow \infty.

This framework generalizes analyses previously confined to mean-field (MF), NNGP, or NTK regimes, allowing explicit computation of feature learning dynamics under μP, which are inaccessible to classical kernel or Gaussian process analyses. Recursive computation of both coordinate distributions and correlations permits explicit evaluation of the nonlinear evolution of both predictions and features.

3. Dichotomy and Classification of Parameterizations

A fundamental contribution of μP is the classification of “abc-parametrizations” into feature-learning and kernel regimes—defined by a critical exponent rr. For any stable parameterization:

r=[min(aL+1+bL+1, 2aL+1+c)+c1+min(2a+1=1)]r = \left[\min \left(a_{L+1} + b_{L+1},\ 2a_{L+1} + c \right) + c - 1 + \min_\ell (2a_\ell + 1_{\ell=1}) \right]

  • r=0r = 0 denotes a feature-learning regime, i.e., hidden features can change substantially at infinite width.
  • r>0r > 0 corresponds to a kernel regime, i.e., training reduces to kernel gradient descent with fixed features.

The NTK and standard parameterizations are prototypical of the r>0r>0 regime, while μP selects r=0r=0 and is unique (up to symmetry under layerwise rescaling) in maximizing all updates while maintaining stable activations and logits.

4. Empirical and Theoretical Outcomes

Experiments on canonical feature-learning tasks—Word2Vec and few-shot MAML (Omniglot)—demonstrate that the μP infinite-width limit results in higher performance than the NTK/GP baselines and even than the best finite-width networks (which approach μP as width increases). In these benchmarks:

  • NTK/GP parameterizations yield random, non-adaptive embeddings (Word2Vec), failing at semantic analogy tasks.
  • μP networks produce meaningful, structured word representations and features that transfer between tasks.
  • In MAML, μP-parameterized networks adapt internal features more efficiently than NTK or standard parameterized networks.

Thus, μP aligns the theoretical design of network scaling with empirically desirable properties in representation learning and adaptation.

5. Mathematical Formulation and Explicit Limits

For an MLP, the standard model is specified as:

h1(ξ)=W1ξ,x(ξ)=ϕ(h(ξ)),h+1(ξ)=W+1x(ξ),f(ξ)=WL+1xL(ξ)h^1(\xi) = W^1 \xi, \qquad x^\ell(\xi) = \phi(h^\ell(\xi)), \qquad h^{\ell+1}(\xi) = W^{\ell+1} x^\ell(\xi), \qquad f(\xi) = W^{L+1} x^L(\xi)

The abc-parametrization for each weight WW^\ell is:

W=naw,wαβN(0,n2b),η=base learning ratencW^\ell = n^{-a_\ell} w^\ell, \quad w^\ell_{\alpha\beta} \sim \mathcal{N}(0, n^{-2b_\ell}), \quad \eta = \text{base learning rate} \cdot n^{-c}

μP is specified by:

a1=12, a=0 (2L), aL+1=12; b=12; c=0a_1 = -\tfrac{1}{2}, \ a_\ell = 0\ (2 \leq \ell \leq L),\ a_{L+1} = \tfrac{1}{2};\ b_\ell = \tfrac{1}{2};\ c=0

The Tensor Programs Master Theorem then guarantees that for such parameterizations, all macroscopic observables (loss, feature evolution) admit explicit deterministic limits.

6. Practical and Theoretical Implications

Adopting μP yields several direct consequences:

  • Models can be designed so that feature learning survives the infinite-width limit, remedying the principal failure of the NTK or standard regimes for pretraining/transfer.
  • Hyperparameters (particularly the learning rate) tuned for a small model remain close to optimal as the model is scaled up, simplifying the hyperparameter tuning process across model families (μTransfer paradigm).
  • μP provides a “knob” for tuning the feature-learning vs. kernel regime tradeoff via choice of scaling exponents—offering a unified theoretical framework that incorporates and extends NTK, mean field, and standard parameterizations.

On canonical tasks, μP’s explicit feature-learning dynamics produce robust solutions at infinite width, outperforming kernel-regime approaches especially on transfer/representation-sensitive objectives.

7. Limitations and Scope

While μP is theoretically grounded and demonstrates strong empirical performance, its applicability presumes careful adherence to its parameter scaling rules, particularly with respect to hidden-to-hidden, input, and output parameter scaling. The Tensor Programs framework supports arbitrary computation graphs composed of matrix multiplications and nonlinearities, but all assumptions of independence and pseudo-Lipschitz operations must be checked if extending to unconventional architectures.

A further plausible implication is that as the community explores increasingly large-scale or modular neural architectures, μP provides the only currently known scaling that jointly permits controlled feature evolution and transfer of hyperparameters, serving as both a practical and analytical standard for future infinite-width analyses.


In summary, Maximal Update Parameterization (μP) bridges the gap between infinite-width analysis and practical feature learning, identifying a unique parameter scaling regime with precise conditions for feature adaptation, robust hyperparameter transfer, and tractable infinite-width dynamics. Its theoretical and empirical foundation establishes μP as central to the scalable and transferable design of modern deep neural networks (Yang et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)