Papers
Topics
Authors
Recent
2000 character limit reached

Decoupled μP Parametrization

Updated 2 January 2026
  • Decoupled μP Parametrization is a neural network parameterization method that enforces unit variance across all layers, decoupling hyperparameters from model width.
  • It employs ABC multipliers in the forward and backward passes to maintain consistent activation and gradient scales, thereby ensuring stable training across different model sizes.
  • The approach streamlines hyperparameter tuning and enables efficient proxy model sweeps, facilitating direct deployment on low-precision hardware such as FP8.

Decoupled μ\muP Parametrization, also referred to as unit-scaled μ\muP ("u-μ\muP"), is a methodology for neural network parameterization that ensures model hyperparameters, such as learning rate, are independent of model width and compatible with low-precision training. u-μ\muP synthesizes the Maximal Update Parametrization (μ\muP) with Unit Scaling, addressing key practical limitations of prior schemes by guaranteeing all activations, weights, and gradients possess variance one at initialization and maintaining consistent scaling throughout training. This approach directly enables zero-shot transfer of hyperparameters across model scales, efficient hyperparameter sweeps using proxy models, and robust operation in low-precision formats such as FP8, while further decoupling model-specific tuning (Blake et al., 2024).

1. Theoretical Foundations and Formal Definition

The u-μ\muP parametrization is built on the abc-parametrization framework: given a linear layer y=Wxy = W x with weight matrix WW of shape (out, in) and fan-in f=inf = \mathrm{in}, one defines

w0N(0,BW2),W=AWw,ww+CW(wL).w_0 \sim N(0, B_W^2),\quad W = A_W \cdot w,\quad w \gets w + C_W\cdot(-\nabla_w\mathcal{L}).

u-μ\muP specifies for each trainable matrix WW:

  • Initialization multiplier: BW=1B_W = 1 (unit variance).
  • Forward-pass multiplier: AWA_W ("parameter scale"):
    • Input embeddings: AW=1A_W = 1
    • Hidden projection weights: AW=1/fA_W = 1/\sqrt{f}
    • Output (readout) weights: AWfwd=1/fA_W^{\text{fwd}} = 1/f, AWbwd=1/fA_W^{\text{bwd}} = 1/\sqrt{f}
    • Residual branches: AW=1/depthA_W = 1/\sqrt{\text{depth}}
  • Backward-pass / learning-rate multiplier: CW=ηcWC_W = \eta\cdot c_W, where
    • Input & hidden: cW=1/fc_W = 1/\sqrt{f}
    • Output: cW=1c_W = 1
    • Residual branches: cW=1/depthc_W = 1/\sqrt{\text{depth}}

These selections enforce scale independence with respect to model width, ff. The scheme is summarized in Table 2 of (Blake et al., 2024) and facilitates width-agnostic training dynamics and update rules.

2. Initialization and Unit Variance Enforcement

Under u-μ\muP, every weight tensor w0,ijw_{0,ij} is initialized i.i.d. from N(0,1)N(0,1). The effective forward weight, Wij=AWwijW_{ij} = A_W w_{ij}, thus has variance AW2A_W^2. For hidden layers, AW=1/fA_W = 1/\sqrt{f} yields Var(Wij)=1/f\mathrm{Var}(W_{ij}) = 1/f. Given input xjN(0,1)x_j \sim N(0,1), the preactivation (Wx)i(Wx)_i satisfies

E[(Wx)i2]=f(1/f)1=1,\mathbb{E}\left[(W x)_i^2\right] = f \cdot (1/f)\cdot 1 = 1,

which is independent of ff. By contrast, standard μ\muP uses BW=1B_W = 1 for input and output, BW=1/fB_W = 1/\sqrt{f} for hidden, and AW=1A_W = 1 everywhere, so only hidden layers are "Xavier-like" at initialization. Hence, u-μ\muP enforces unit variance across all layers, guaranteeing numerically safe ranges at initialization.

3. Forward and Backward Scaling

The forward and backward passes in u-μ\muP are explicitly designed to ensure activations, gradients, and weight updates remain at O(1)O(1) or O(η)O(\eta) scale, independent of width.

Forward: For hidden weights,

  • wijN(0,1)w_{ij} \sim N(0,1), W=w/fW = w/\sqrt{f}, xjN(0,1)x_j \sim N(0,1)
  • yi=jWijxjN(0,1)y_i = \sum_j W_{ij}x_j \sim N(0,1)

Backward: The parameter update is

Δw=CWL/w\Delta w = -C_W\cdot \partial \mathcal{L} / \partial w

So,

ΔW=AWΔw=η(AWCW)L/w\Delta W = A_W\cdot \Delta w = -\eta\cdot (A_W C_W)\, \partial \mathcal{L} / \partial w

For hidden weights, AWCW=(1/f)(η/f)=η/fA_W C_W = (1/\sqrt{f})(\eta/\sqrt{f}) = \eta/f, ensuring each coordinate of WW shifts by O(η/f)O(\eta/f), exactly compensating the f\sqrt{f} scaling in the forward pass. All relevant quantities are constant-scale with respect to ff.

Distinct parameterization schemes are tabulated to facilitate comparison:

Parametrization AWA_W BWB_W CWC_W
SP/Xavier 1/f1/\sqrt{f} 1/f1/\sqrt{f} η\eta
NTK 1/f1/\sqrt{f} 1/f1/\sqrt{f} η/f\eta/f
μ\muP $1$ (hidden), input/output: 1 1/f1/\sqrt{f} (hidden), 1 (others) η/f\eta/f (hidden), $1$ (others)
u-μ\muP 1/f1/\sqrt{f} (hidden), 1 (others) $1$ η/f\eta/\sqrt{f} (hidden), η\eta (output)

u-μ\muP discards the "base-shape" and σinit\sigma_{\text{init}} grouping hyperparameters of standard μ\muP, instead enforcing all weights to unit variance (BW=1B_W=1) and deploying ABC multipliers exclusively to cancel width-dependent factors in forward/backward computations. α-multipliers are attached to nonhomogeneous operations for interpretability. Embedding learning rates are consistently handled via cemb=η/dembc_{\text{emb}} = \eta/\sqrt{d_{\text{emb}}}, enhancing empirical transfer properties across widths. Residual branch scaling is further decoupled into two hyperparameters, αres\alpha_{\text{res}} and αres-attn-ratio\alpha_{\text{res-attn-ratio}}, to separately control embedding and modality-specific contributions.

5. Default Hyperparameters and Low-Precision Compatibility

Unit-scale enforcements render most α-multipliers near $1$ at optimum. The recommended defaults are:

  • Global learning rate η=21.52.8\eta = 2^{1.5} \approx 2.8
  • αres=1\alpha_{\text{res}} = 1, αres-attn-ratio=1\alpha_{\text{res-attn-ratio}} = 1, αffn-act=1\alpha_{\text{ffn-act}} = 1, αattn=1\alpha_{\text{attn}} = 1, αout=1\alpha_{\text{out}} = 1

Empirical 1D sweeps demonstrate that only η\eta benefits from tuning, with all other α's at $1$ resulting in performance within <1%<1\% of optimal loss. For low-precision training (e.g., FP8, E4M3 format), all matrix multiplication inputs, weights, and output gradients possess RMS1\mathrm{RMS} \approx 1, obviating the need for per-tensor dynamic scaling. Certain layers that increase in scale during training (e.g., final attention/FFN projections) can use wider formats (E5M2) or minimal rescaling, maintaining degradation below $0.1$ pp. By contrast, μ\muP sans Unit Scaling exhibits divergence due to gradient underflow in FP8 regimes.

6. Decoupling Properties and Practical Implications

u-μ\muP achieves full decoupling of hyperparameters—such as learning rate and α-multipliers—from model width, depth, batch size, and precision. The mechanism entails:

  • Initializing wN(0,1)w \sim N(0,1), scaling via AW=1/fA_W = 1/\sqrt{f} in hidden layers.
  • Assigning per-parameter learning rates cW=η/fc_W = \eta/\sqrt{f}, such that AWcW=η/fA_W c_W = \eta/f.
  • Matching width-dependent factors leads to O(1)O(1) dynamics at all stages and updates ΔW=η(WL/f)\Delta W = -\eta \cdot (\nabla_W \mathcal{L} / f).

Auxiliary engineering choices (such as σinit\sigma_{\text{init}} and base shape) are removed, retaining only interpretable α-multipliers and a global η\eta. The result is a two-stage hyperparameter search: a single η\eta sweep, followed by optional 1D sweeps for α-multipliers. This process is computationally efficient, making hyperparameter optimization tractable at proxy scale and directly transferable to large models.

This suggests practical model training can proceed with minimal bespoke tuning, direct export to mixed- or low-precision hardware, and guarantees hyperparameter transferability.

7. Summary and Broader Impact

u-μ\muP constitutes a decoupled parameterization framework in which core hyperparameters are rendered independent of network architecture and compute regime. Key outcomes include:

  • Zero-shot learning rate transfer from small proxy models.
  • Efficient, independent hyperparameter search—often only a 1D η\eta sweep.
  • Out-of-the-box compatibility with FP8-level low-precision formats.
  • Full interpretability of scaling multipliers as unit-variance ratios.

These advances streamline large-scale model development and facilitate robust deployment across varying model and hardware configurations, using only minor code adjustments (Blake et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Decoupled $μ$P Parametrization.