Decoupled μP Parametrization

Updated 2 January 2026

Decoupled μP Parametrization is a neural network parameterization method that enforces unit variance across all layers, decoupling hyperparameters from model width.
It employs ABC multipliers in the forward and backward passes to maintain consistent activation and gradient scales, thereby ensuring stable training across different model sizes.
The approach streamlines hyperparameter tuning and enables efficient proxy model sweeps, facilitating direct deployment on low-precision hardware such as FP8.

Decoupled $\mu$ P Parametrization, also referred to as unit-scaled $\mu$ P ("u- $\mu$ P"), is a methodology for neural network parameterization that ensures model hyperparameters, such as learning rate, are independent of model width and compatible with low-precision training. u- $\mu$ P synthesizes the Maximal Update Parametrization ( $\mu$ P) with Unit Scaling, addressing key practical limitations of prior schemes by guaranteeing all activations, weights, and gradients possess variance one at initialization and maintaining consistent scaling throughout training. This approach directly enables zero-shot transfer of hyperparameters across model scales, efficient hyperparameter sweeps using proxy models, and robust operation in low-precision formats such as FP8, while further decoupling model-specific tuning (Blake et al., 2024).

1. Theoretical Foundations and Formal Definition

The u- $\mu$ P parametrization is built on the abc-parametrization framework: given a linear layer $y = W x$ with weight matrix $W$ of shape (out, in) and fan-in $f = \mathrm{in}$ , one defines

$w_0 \sim N(0, B_W^2),\quad W = A_W \cdot w,\quad w \gets w + C_W\cdot(-\nabla_w\mathcal{L}).$

u- $\mu$ P specifies for each trainable matrix $W$ :

Initialization multiplier: $B_W = 1$ (unit variance).
Forward-pass multiplier: $A_W$ $A_{W}$ ("parameter scale"):
- Input embeddings: $A_W = 1$
- Hidden projection weights: $A_W = 1/\sqrt{f}$
- Output (readout) weights: $A_W^{\text{fwd}} = 1/f$ , $A_W^{\text{bwd}} = 1/\sqrt{f}$
- Residual branches: $A_W = 1/\sqrt{\text{depth}}$
Backward-pass / learning-rate multiplier: $C_W = \eta\cdot c_W$ $C_{W} = η \cdot c_{W}$ , where
- Input & hidden: $c_W = 1/\sqrt{f}$
- Output: $c_W = 1$
- Residual branches: $c_W = 1/\sqrt{\text{depth}}$

These selections enforce scale independence with respect to model width, $f$ . The scheme is summarized in Table 2 of (Blake et al., 2024) and facilitates width-agnostic training dynamics and update rules.

2. Initialization and Unit Variance Enforcement

Under u- $\mu$ P, every weight tensor $w_{0,ij}$ is initialized i.i.d. from $N(0,1)$ . The effective forward weight, $W_{ij} = A_W w_{ij}$ , thus has variance $A_W^2$ . For hidden layers, $A_W = 1/\sqrt{f}$ yields $\mathrm{Var}(W_{ij}) = 1/f$ . Given input $x_j \sim N(0,1)$ , the preactivation $(Wx)_i$ satisfies

$\mathbb{E}\left[(W x)_i^2\right] = f \cdot (1/f)\cdot 1 = 1,$

which is independent of $f$ . By contrast, standard $\mu$ P uses $B_W = 1$ for input and output, $B_W = 1/\sqrt{f}$ for hidden, and $A_W = 1$ everywhere, so only hidden layers are "Xavier-like" at initialization. Hence, u- $\mu$ P enforces unit variance across all layers, guaranteeing numerically safe ranges at initialization.

3. Forward and Backward Scaling

The forward and backward passes in u- $\mu$ P are explicitly designed to ensure activations, gradients, and weight updates remain at $O(1)$ or $O(\eta)$ scale, independent of width.

Forward: For hidden weights,

$w_{ij} \sim N(0,1)$ , $W = w/\sqrt{f}$ , $x_j \sim N(0,1)$
$y_i = \sum_j W_{ij}x_j \sim N(0,1)$

Backward: The parameter update is

$\Delta w = -C_W\cdot \partial \mathcal{L} / \partial w$

So,

$\Delta W = A_W\cdot \Delta w = -\eta\cdot (A_W C_W)\, \partial \mathcal{L} / \partial w$

For hidden weights, $A_W C_W = (1/\sqrt{f})(\eta/\sqrt{f}) = \eta/f$ , ensuring each coordinate of $W$ shifts by $O(\eta/f)$ , exactly compensating the $\sqrt{f}$ scaling in the forward pass. All relevant quantities are constant-scale with respect to $f$ .

Distinct parameterization schemes are tabulated to facilitate comparison:

Parametrization	$A_W$	$B_W$	$C_W$
SP/Xavier	$1/\sqrt{f}$	$1/\sqrt{f}$	$\eta$
NTK	$1/\sqrt{f}$	$1/\sqrt{f}$	$\eta/f$
$\mu$ P	$1$ (hidden), input/output: 1	$1/\sqrt{f}$ (hidden), 1 (others)	$\eta/f$ (hidden), $1$ (others)
u- $\mu$ P	$1/\sqrt{f}$ (hidden), 1 (others)	$1$	$\eta/\sqrt{f}$ (hidden), $\eta$ (output)

u- $\mu$ P discards the "base-shape" and $\sigma_{\text{init}}$ grouping hyperparameters of standard $\mu$ P, instead enforcing all weights to unit variance ( $B_W=1$ ) and deploying ABC multipliers exclusively to cancel width-dependent factors in forward/backward computations. α-multipliers are attached to nonhomogeneous operations for interpretability. Embedding learning rates are consistently handled via $c_{\text{emb}} = \eta/\sqrt{d_{\text{emb}}}$ , enhancing empirical transfer properties across widths. Residual branch scaling is further decoupled into two hyperparameters, $\alpha_{\text{res}}$ and $\alpha_{\text{res-attn-ratio}}$ , to separately control embedding and modality-specific contributions.

5. Default Hyperparameters and Low-Precision Compatibility

Unit-scale enforcements render most α-multipliers near $1$ at optimum. The recommended defaults are:

Global learning rate $\eta = 2^{1.5} \approx 2.8$
$\alpha_{\text{res}} = 1$ , $\alpha_{\text{res-attn-ratio}} = 1$ , $\alpha_{\text{ffn-act}} = 1$ , $\alpha_{\text{attn}} = 1$ , $\alpha_{\text{out}} = 1$

Empirical 1D sweeps demonstrate that only $\eta$ benefits from tuning, with all other α's at $1$ resulting in performance within $<1\%$ of optimal loss. For low-precision training (e.g., FP8, E4M3 format), all matrix multiplication inputs, weights, and output gradients possess $\mathrm{RMS} \approx 1$ , obviating the need for per-tensor dynamic scaling. Certain layers that increase in scale during training (e.g., final attention/FFN projections) can use wider formats (E5M2) or minimal rescaling, maintaining degradation below $0.1$ pp. By contrast, $\mu$ P sans Unit Scaling exhibits divergence due to gradient underflow in FP8 regimes.

6. Decoupling Properties and Practical Implications

u- $\mu$ P achieves full decoupling of hyperparameters—such as learning rate and α-multipliers—from model width, depth, batch size, and precision. The mechanism entails:

Initializing $w \sim N(0,1)$ , scaling via $A_W = 1/\sqrt{f}$ in hidden layers.
Assigning per-parameter learning rates $c_W = \eta/\sqrt{f}$ , such that $A_W c_W = \eta/f$ .
Matching width-dependent factors leads to $O(1)$ dynamics at all stages and updates $\Delta W = -\eta \cdot (\nabla_W \mathcal{L} / f)$ .

Auxiliary engineering choices (such as $\sigma_{\text{init}}$ and base shape) are removed, retaining only interpretable α-multipliers and a global $\eta$ . The result is a two-stage hyperparameter search: a single $\eta$ sweep, followed by optional 1D sweeps for α-multipliers. This process is computationally efficient, making hyperparameter optimization tractable at proxy scale and directly transferable to large models.

This suggests practical model training can proceed with minimal bespoke tuning, direct export to mixed- or low-precision hardware, and guarantees hyperparameter transferability.

7. Summary and Broader Impact

u- $\mu$ P constitutes a decoupled parameterization framework in which core hyperparameters are rendered independent of network architecture and compute regime. Key outcomes include:

Zero-shot learning rate transfer from small proxy models.
Efficient, independent hyperparameter search—often only a 1D $\eta$ sweep.
Out-of-the-box compatibility with FP8-level low-precision formats.
Full interpretability of scaling multipliers as unit-variance ratios.

These advances streamline large-scale model development and facilitate robust deployment across varying model and hardware configurations, using only minor code adjustments (Blake et al., 2024).

PDF Markdown Chat (Pro)

References (1)

u-$μ$P: The Unit-Scaled Maximal Update Parametrization (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Decoupled $μ$P Parametrization.