Theoretical justification for the new embedding learning-rate scaling rule
Establish a theoretical justification for scaling the embedding layer learning rate by 1/sqrt(fan-in) in the Unit-Scaled Maximal Update Parametrization (u), clarifying why this rule should replace the Maximal Update Parametrization’s constant-width embedding learning-rate scaling and under what assumptions it ensures improved hyperparameter transfer across width.
References
We offer no theoretical justification for our rule, which we leave to further work.
                — u-$μ$P: The Unit-Scaled Maximal Update Parametrization
                
                (2407.17465 - Blake et al., 24 Jul 2024) in Section “A new embedding LR rule” (sec:umup:emb_lr_rule)