Papers
Topics
Authors
Recent
2000 character limit reached

Maximal Update Parameterization (μP)

Updated 9 October 2025
  • Maximal Update Parameterization (μP) is a framework that scales neural network parameters to maintain O(1) updates across layers, ensuring stable feature learning as model width increases.
  • It enables nearly invariant hyperparameter transfer from small proxy models to full-scale networks, significantly reducing tuning costs and computational overhead.
  • Empirical results in language, vision, diffusion models, and sparse networks validate μP's ability to improve convergence speed, performance stability, and resource efficiency.

Maximal Update Parametrization (μP) is a principled framework for neural network parameterization that ensures all layers in a deep neural network update at commensurate scales as width increases. By enforcing what is termed "maximal feature learning" throughout training—where the magnitude of updates per layer remains order one even in the infinite-width limit—μP addresses the instability and inefficiency that arise in standard parameterization (SP) when scaling to large models. This core mechanism enables zero-shot hyperparameter transfer across model scales, leading to the muTransfer paradigm: tuning hyperparameters on small proxy models and directly applying them to full-sized networks without further tuning.

1. Mathematical Foundations and Scaling Rules

In μP, network parameters are scaled according to an "abc-parametrization" so that updates to activations and features in each layer retain an O(1) magnitude as width n grows. Specifically, for a weight matrix WW^\ell in layer \ell, the parameterization is defined as:

W=w/na,wN(0,σ2/n2b),learning rate: η=η0/ncW^\ell = w^\ell/n^{a_\ell}, \quad w^\ell \sim \mathcal{N}(0, \sigma^2/n^{2b_\ell}), \quad \text{learning rate: } \eta^\ell = \eta_0/n^{c_\ell}

The exponents aa_\ell, bb_\ell, and cc_\ell are chosen according to the requirement that the activated feature updates ΔWh1=Θ(1)\Delta W_\ell h_{\ell-1} = \Theta(1). For multilayer perceptrons and transformers, this leads to layer-specific scaling—for example, output layers often use an initialization variance of 1/n21/n^2 and a learning rate scaling as ηn1\eta \cdot n^{-1}, while hidden layers typically involve variance scaling of $1/n$ and learning rates of order $1$.

For second-order optimizers such as K-FAC and Shampoo, μP imposes explicit scaling rules for initialization, learning rate, and damping terms to ensure that preconditioned gradient steps maintain O(1)\mathcal{O}(1) feature changes even as width increases. The scaling conditions for these optimizers are derived from a one-step perturbation analysis, leading to layerwise formulas depending on the nature of the algorithm (e.g., exponents eAe_A, eBe_B for K-FAC, ee for Shampoo) (Ishikawa et al., 2023).

Transformer architectures further generalize μP rules, dictating that, e.g., the initialization variance for multi-head projection matrices (queries, keys, values) scales as $1/M$ for width MM, and the attention scaling factor is shifted from the standard 1/D1/\sqrt{D} to $1/D$, which empirically proves crucial for learning rate transfer (Lingle, 8 Apr 2024).

2. Hyperparameter Stability and the muTransfer Paradigm

μP's most distinctive feature is its hyperparameter invariance property: the (near-)optimal hyperparameters (learning rate, momentum, damping) selected on a narrow, resource-efficient proxy model remain optimal as the model is scaled up—this is the essence of muTransfer.

Standard parameterization suffers from dramatic shifts in optimal hyperparameters as model width increases (sometimes by orders of magnitude), requiring laborious re-tuning. In μP, rigorous analysis and extensive experiments demonstrate that learning rate optima, momentum, and even per-layer multipliers remain stable under arbitrary width scaling, for both LLMs and vision architectures (Yang et al., 2022). For diffusion transformers, Fourier Neural Operators (FNOs), and even in the presence of sparsity and local learning rules, μP provides explicit scaling laws so that hyperparameter transfer holds robustly (Zheng et al., 21 May 2025, Li et al., 24 Jun 2025, Dey et al., 24 May 2024, Ishikawa et al., 4 Nov 2024).

The muTransfer workflow is:

  1. Reparameterize the network in μP.
  2. Tune hyperparameters on a smaller proxy model.
  3. Directly copy these hyperparameters to the large-scale target model.

Quantitatively, this approach reduces hyperparameter search cost for billion-parameter models—e.g., hyperparameter tuning for GPT-3 6.7B consumes only 7% of the pretraining compute compared to direct large-model sweeps (Yang et al., 2022), and tuning for PixArt-α-μP uses merely 5.5% of one full run (Zheng et al., 21 May 2025).

3. Extension to Structured Architectures and Optimizers

μP's theoretical apparatus is extensible beyond classic MLP/Transformer models, proven both for:

  • Second-order optimization: Explicit scaling (including damping) ensures maximal update, width-invariant learning rates, and feature learning in K-FAC and Shampoo, facilitating transfer across model sizes (Ishikawa et al., 2023).
  • Sparse networks: Sparse maximal update parameterization (SμPar) generalizes μP to static sparsity by correcting initialization and learning rate scales for both model width and density, yielding loss and compute efficiency gains as sparsity increases (Dey et al., 24 May 2024).
  • Diffusion transformers and FNOs: By expressing architectural primitives within the Tensor Programs framework, μP rules (the "abc-parametrization") can be systematically applied to advanced architectures (e.g., DiT, PixArt-α, MMDiT, FNOs), with mathematically derived scaling laws tailored to e.g., number of Fourier modes K and their dependence on PDE dimensionality for FNOs: b(K)=c(K)=Θ(1/dlogK)b(K) = c(K) = \Theta(1/\sqrt{d \log K}) (Li et al., 24 Jun 2025).
  • Local learning: Predictive coding and target propagation are shown, via tailored μP derivations, to allow robust hyperparameter transfer and enforce feature learning rather than kernel (lazy) regimes in the infinite-width limit (Ishikawa et al., 4 Nov 2024).
  • Novel parameterizations: u-μP (unit-scaled μP) combines maximal update with unit scaling of activations, weights, and gradients, yielding simpler hyperparameters, low-precision robustness (e.g., stable working in FP8), and even lower error surfaces for sweeps (Blake et al., 24 Jul 2024).

4. Empirical Results and Performance Metrics

Extensive empirical assessment confirms the robustness, efficiency, and accuracy benefits of μP:

  • LLMs: In BERT-large (350M), loss improved from 1.731 (SP) to 1.683 (μP) with transferred hyperparameters, and downstream validation accuracy (e.g., MNLI, QQP) improved slightly (Yang et al., 2022).
  • GPT-3 family: μP tuning matched or exceeded published performance for 6.7B and approached 13B by transferring from a 40M-parameter proxy at a fraction of the tuning cost.
  • Diffusion Transformers: DiT-XL-2-μP converged 2.9 times faster than the original at optimal transferred learning rates; PixArt-α-μP and MMDiT-μP achieved lower FID and higher alignment scores for text-to-image (Zheng et al., 21 May 2025).
  • FNOs: μTransfer-FNO consistently yielded stable loss curves and optimal learning rates across 3+ orders of Fourier mode scaling (e.g., retaining accuracy with 0.3× the usual compute) (Li et al., 24 Jun 2025).
  • Sparse LMs: SμPar led to an 11.9% relative improvement in loss at 99.2% sparsity over SP, and a 2.1% improvement over μP (Dey et al., 24 May 2024).

These findings are summarized in the following table:

Model/Domain μP Hyperparameter Transfer Cost (fraction) Relative Performance Gain
BERT-large 1× BERT-large run Improved loss over baseline
GPT-3 6.7B 7% of pretraining Comparable or better than baseline
DiT-XL-2-μP Noted in FID and speedup 2.9× faster convergence
PixArt-α (0.61B) 5.5% of a full pretraining run Lower FID, higher CLIP/GenEval
SμPar at 99.2% sparsity N/A 11.9% relative loss improvement
FNOs (Navier–Stokes) 0.3× compute Equal or better test RMSE

5. Implementation Practices and Challenges

Practical implementation of μP involves:

  • PyTorch mup package: Automates conversion from standard parameterization to μP via set_base_shape(model, base_model), tracks "infinite shape," and adjusts initialization/learning rates per rule tables (Yang et al., 2022).
  • Manual architectural adaptations: Correct groupwise rescaling for transformers (e.g., per-parameter learning rates, attention scales) is critical. For diffusion transformers, all major modules are adapted within the Tensor Programs formalism (Zheng et al., 21 May 2025).
  • Sparsity: Only hidden-layer weights require correction; embeddings, biases, or attention logits remain unaltered. Practical minimal implementations are available and can be integrated into existing training pipelines (Dey et al., 24 May 2024).

Empirical studies note some implementation caveats:

  • Certain architectural nuances (trainable normalization gains, optimizer choices such as Lion, per-layer bias inclusion) can disrupt hyperparameter transfer if not aligned with μP-prescribed scaling (Lingle, 8 Apr 2024).
  • u-μP resolves several complexities of μP by enforcing unit-scaled tensors, simplifying HP sweeps (allowing one-dimensional sweeps with low dependency) and robustly enabling low-precision training (Blake et al., 24 Jul 2024).
  • For finite-width errors in hyperparameter transfer, telescoping sweep protocols are proposed, controlling O(1/n) drift and sampling error with only a modest O(C log N) overhead for billion-parameter models (AI et al., 4 May 2025).

6. Comparative Perspective, Generalization, and Limitations

Recent work shows that while μP provides theoretically sound scaling for hyperparameter transfer, comparable transfer can be achieved in Standard parametrization by explicitly engineering per-layer learning rate schedules: e.g., ηembedding=O(1)\eta_\text{embedding}=O(1), ηhidden,readout=O(1/n)\eta_\text{hidden,readout}=O(1/n). Empirically, this setup (with finely tuned constant multipliers) can match or outperform μP in some settings (Everett et al., 8 Jul 2024). The Adam-atan2 optimizer eliminates ε underflow issues that may break scale-invariance in very wide networks.

Extensions of μP to advanced architectures (diffusion models, FNOs, sparse and local learning methods) highlight its versatility, but in each instance, successful hyperparameter transfer relies on precise adherence to the derived scaling rules. Deviations (e.g., using the standard attention scale 1/D1/\sqrt{D} instead of μP's $1/D$) can disrupt transfer and model quality (Lingle, 8 Apr 2024). For optimizers such as Lion, additional adaptation may be required.

7. Applications and Broader Implications

μP and its muTransfer methodology have direct impact in:

  • LLMs and Vision Networks: Drastically reducing hyperparameter tuning cost for models up to tens of billions of parameters.
  • Sparse and Efficient Networks: Enabling practical, stable scaling of sparse architectures for computational and hardware gains.
  • Structured Models and Non-backpropagation Training: Ensuring robust learning dynamics and tunable transfer in models leveraging local objectives or nonstandard feedback pathways.
  • Generative Modality Extensions: Allowing scalable, resource-efficient tuning for massive text-to-image and geometry-to-function generative models.

By unifying scaling for initialization, learning updates, and optimizer hyperparameters, μP both reveals key principles for stable deep learning dynamics and provides concrete tools to operationalize hyperparameter transfer at any scale. This ensures both the feasibility and efficiency of exploring, deploying, and scaling advanced deep neural networks for modern AI workloads.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Maximal Update Parameterization (muP).