Universality of the √d weight-decay scaling across architectures

Determine whether the empirically observed weight-decay scaling rule λ2 ∝ √d for matrix-like parameters trained with AdamW—used to preserve sublayer-gain invariance across widths in LLaMA-style Transformers—is universal across architectures such as mixture-of-experts and alternatives to self-attention, or whether the scaling factor depends on specific architectural choices.

Background

The paper studies how to achieve width-robust hyperparameter transfer under AdamW by tuning weight decay in conjunction with μP learning-rate rules. Empirically, in the optimizer-governed steady state, singular-value magnitudes of matrix parameters scale with √(η/λ) while spectrum shape is approximately invariant. Under width scaling d, the top singular value scales as √(η/λ)·d^0.75. Combining this with μP’s matrix learning-rate rule η2 ∝ d⁻¹ implies a weight-decay scaling λ2 ∝ √d for matrix-like parameters, while vector-like parameters use η1 = Θd(1) and λ1 = 0.

The authors validate this scheme on LLaMA-style Transformers and a synthetic two-layer FFN, and propose matching top singular values as a diagnostic for sublayer-gain invariance. However, they explicitly note uncertainty about whether this λ2 ∝ √d rule generalizes beyond the studied setting, particularly to mixture-of-experts architectures, alternatives to self-attention, or other architectural choices that may alter the scaling factor.

References

It is not obvious whether the observed scaling rule $\lambda_2$ is universal across all architectures: mixture-of-experts architectures, alternatives to self-attention, or other architectural choices might alter the scaling factor, and this would be interesting to study.

— Robust Layerwise Scaling Rules by Proper Weight Decay Tuning (2510.15262 - Fan et al., 17 Oct 2025) in Conclusion — Scope paragraph

Universality of the √d weight-decay scaling across architectures

Background

References

Related Problems