Universality of the √d weight-decay scaling across architectures
Determine whether the empirically observed weight-decay scaling rule λ2 ∝ √d for matrix-like parameters trained with AdamW—used to preserve sublayer-gain invariance across widths in LLaMA-style Transformers—is universal across architectures such as mixture-of-experts and alternatives to self-attention, or whether the scaling factor depends on specific architectural choices.
References
It is not obvious whether the observed scaling rule $\lambda_2$ is universal across all architectures: mixture-of-experts architectures, alternatives to self-attention, or other architectural choices might alter the scaling factor, and this would be interesting to study.
— Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
(2510.15262 - Fan et al., 17 Oct 2025) in Conclusion — Scope paragraph