Scaling rules for learnable multipliers under μP-style model-size scaling
Determine whether learning rate and weight decay hyperparameters for learnable multipliers should be scaled with model size (width) to preserve activation magnitudes and feature-learning strength when generalizing classical μP scaling to architectures with learnable multipliers.
References
Yet, many questions are left open. Next set of question is related to developing a complete set of scaling rules, generalizing classical μP scaling to the presence of learnable multipliers. For example, should we scale LR and WD (which constraints symmetries) of multipliers with model size?
— Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers
(2601.04890 - Velikanov et al., 8 Jan 2026) in Section 6: Conclusion and discussion