Maximal Update Parameterization
- Maximal Update Parameterization is a method that reparameterizes neural network training so that weight updates remain O(1) even as model size increases.
- Its abc-parameterization framework ensures that tuning hyperparameters on small proxy models directly transfers to larger architectures with minimal optimality loss.
- Extensions like u-μP and SμPar adapt μP for diverse scenarios including operator networks and sparse regimes, yielding significant computational and performance gains.
Maximal Update Parameterization
Maximal Update Parameterization (μP) is a systematic scheme for reparameterizing neural network training such that model optimization hyperparameters—including learning rates and initialization scales—can be tuned on a small proxy model and directly transferred to arbitrarily larger, deeper, or sparser models with minimal loss of optimality. The defining principle is that, in the infinite-width (and more generally, infinite size) limit, weight updates in every layer remain order-one (O(1)), guaranteeing nontrivial and stable feature learning across layers while avoiding regimes where signals either vanish (“lazy/NTK”) or explode. μP is formulated through the so-called abc-parametrization of weights and updates, providing a unified foundation for width-, depth-, or modality-scaling across dense, sparse, operator-based, and local learning regimes (Blake et al., 2024, Li et al., 24 Jun 2025, Yaida, 2022, Ishikawa et al., 2023, Ishikawa et al., 2024, Dey et al., 2024).
1. Foundations and abc-Parameterization
At the core of μP is the abc-parametrization, whereby every weight tensor in layer is represented as:
- Raw parameterization:
- Weight scaling:
- Gradient update scaling:
The triple is specified with explicit dependence on the width, depth, or kernel size, and their associated scaling rules , , and are selected to ensure two properties:
- Feature-learning maximality: Each layer’s updates have the same order in model size, preventing “feature freezing”
- Hyperparameter transferability: Optimal , , etc., identified on a small proxy reliably extend to large-scale models
In typical Transformer-style blocks, these scalings take the form: | Tensor Type | (param) | (init) | (LR) | |------------------|---------------|--------------|------------| | Input weights | 1 | 1 | 1 | | Hidden weights | 1 | | | | Output weights | | 1 | 1 | | Residual branch | | — | — |
This guarantees that regardless of width or depth (Blake et al., 2024, Yaida, 2022).
2. Mathematical Formulation and Infinite-Width Regime
μP formalizes these rules by setting scaling exponents per layer (often fixing ) in such a way that:
- Initializations and LRs obey
- “Maximal Update” ensures
for all
- Readout layer ensures similarly for output features
For standard backpropagation, the canonical choices are: , , ; , , .
This structure enforces all , achieving maximal feature learning and stable forward/backward signal propagation even as (Yaida, 2022, Ishikawa et al., 2024).
3. Extensions: Unit Scaling and u-μP
Unit Scaling is an independent scheme whereby all tensors—activations, weights, gradients—are initialized to unit variance, keeping their values near 1.0 for efficient floating-point representation. u-μP is the fusion of μP and Unit Scaling:
- All initializations (unit variance)
- All matmuls use
- Adam learning-rate multipliers reduced to a single global ; i.e., no separate per-tensor needed
Typical default hyperparameters in u-μP (each default to 1 unless noted) include global , , , , . In empirical studies, setting all yields losses within 1% of the best achievable, and hyperparameter sweeps become independent and trivially efficient (Blake et al., 2024).
4. Algorithmic and Practical Consequences
μP and its variants enable robust hyperparameter transfer:
- Tune on a small proxy (width/density/size reduced by $4$–)
- For full-scale or production deployment, reuse tuned , without retuning
- For u-μP, a single 1D sweep over suffices for near-optimality even at extreme width/depth/batch scaling
In sparse regimes, SμPar generalizes μP to guarantee activation/gradient/update norms invariant to both width and sparsity. Initialization standard deviations and LRs are reparameterized by width and density multipliers so that optimal HPs can be tuned once and transferred to settings with arbitrary sparsity. SμPar defines the Pareto frontier of validation loss with no per-sparsity retuning (Dey et al., 2024).
5. Applications: Operator Networks, Local Learning, and Second-Order Optimization
μP has been rigorously extended beyond standard dense networks:
- Operator networks: For Fourier Neural Operators (FNOs), μP’s scaling for the number of modes is , providing zero-shot hyperparameter transfer for billion-parameter models (Li et al., 24 Jun 2025).
- Local learning: In predictive coding (PC) and target propagation (TP), μP ensures feature learning in the infinite-width limit even where local gradient rules deviate from standard BP; PC interpolates between first-order and Gauss-Newton gradients depending on parameterization, while TP eliminates the lazy kernel regime entirely (Ishikawa et al., 2024).
- Second-order methods: μP generalizations for K-FAC and Shampoo derive layerwise initializations, learning-rate scalings, and damping rules for high-width regimes. Hyperparameter transfer and feature learning are stabilized compared to PyTorch defaults, with increased generalization performance as model scale grows (Ishikawa et al., 2023).
- SAM and perturbation-based optimization: μP defines the unique layerwise SAM radius scaling maintaining effective perturbation and learning in all layers as width increases, contrasting with standard SAM's last-layer-only effect. The joint optimum for learning-rate and SAM radius is width-invariant (Haas et al., 2024).
6. Training Dynamics, Hyperparameter Transfer, and Empirical Outcomes
Under μP, feature learning remains active: signals neither vanish nor explode, and every layer moves at O(1) per update—a sharp contrast to NTK/lazy regimes, where only the output layer evolves. Further, in u-μP, mutual independence of hyperparameters means that individual sweeps can be combined without retuning—a property not shared by vanilla μP. Empirical studies show stable losses, improved generalization, and low-precision training compatibility (FP8/BF16) out-of-the-box in u-μP, where traditional μP may underflow (Blake et al., 2024).
On large language modeling tasks, SμPar yields up to relative loss improvement at extreme sparsity (99.2%) and up to compute efficiency gain over standard parameterizations. For operator networks, μTransfer-FNO maintains test error while reducing compute cost to the cost of direct tuning on the full-scale model. In second-order optimization, μP’s prescription ensures joint transferability of learning rate and damping parameters and avoids pathological dynamics observed in SP or uniform parametrizations (Dey et al., 2024, Li et al., 24 Jun 2025, Ishikawa et al., 2023).
7. Generalizations, Limitations, and Theoretical Insights
μP forms the “maximal update/mean-field” endpoint () in a one-parameter meta-family of infinite-width scaling strategies, interpolating between NTK () and coupled limits ($0 and robust representation learning when (Yaida, 2022). The abc-symmetry provides a mathematical invariance, allowing redistribution of scale among , , without affecting optimization or forward activations.
The μP paradigm is generic:
- It applies wherever model size (width, sparsity, number of modes/operators) increases systematically
- The only caveat is delicate handling of residual branches and operator layers, which may require nontrivial spectral-norm scaling analysis (Li et al., 24 Jun 2025)
- For very large architectures or alternative update rules (e.g., local or operator-based learning), explicit verification of O(1) update and signal norms may be required, following the detailed scaling laws for each structure
In sum, Maximal Update Parameterization provides a principled, scalable framework for hyperparameter tuning and transfer, ensuring robust feature-learning and stable dynamics across the entire spectrum of modern neural network families (Blake et al., 2024, Li et al., 24 Jun 2025, Dey et al., 2024, Yaida, 2022, Ishikawa et al., 2024, Ishikawa et al., 2023).