Scaling rules for learnable multipliers under μP-style model-size scaling

Determine whether learning rate and weight decay hyperparameters for learnable multipliers should be scaled with model size (width) to preserve activation magnitudes and feature-learning strength when generalizing classical μP scaling to architectures with learnable multipliers.

Background

Classical μP scaling provides width-scaling rules to maintain stable activations and updates, but learnable multipliers change scaling dynamics. The authors ask whether LR and WD for multipliers should be scaled as width grows to maintain desired properties and resolve symmetry-induced drift.

References

Yet, many questions are left open. Next set of question is related to developing a complete set of scaling rules, generalizing classical μP scaling to the presence of learnable multipliers. For example, should we scale LR and WD (which constraints symmetries) of multipliers with model size?

— Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers (2601.04890 - Velikanov et al., 8 Jan 2026) in Section 6: Conclusion and discussion

Scaling rules for learnable multipliers under μP-style model-size scaling

Background

References

Related Problems