Explain the sudden change in convergence rate of the optimal learning rate at large width under μP

Determine the cause of the observed sudden change in the empirical convergence rate of the optimal learning rate for linear multi-layer perceptrons parametrized with Maximal Update Parametrization (μP) after one gradient step, where the rate matches the theoretical O(n^{-1/2}) prediction up to width n=1024 but becomes much smaller for larger widths; characterize the true asymptotic convergence rate and identify the conditions under which the upper bound O(n^{-1/2}) is not tight.

Background

The paper proves that, for linear multi-layer perceptrons trained with gradient descent under μP, the optimal one-step learning rate converges to a non-zero constant with a theoretical convergence rate of O(n^{-1/2}).

Empirical results show agreement with the O(n^{-1/2}) rate up to width n=1024, but for larger widths the observed convergence becomes substantially faster, suggesting the theoretical upper bound may be loose in practice.

The authors explicitly note they do not have an explanation for this behavior, making the identification of the mechanism and precise asymptotic rate an open question with practical significance for hyperparameter transfer across widths.

References

Interestingly, the empirical convergence rate seems to match the theoretical prediction of n^{-1/2} up to width n=1024 then becomes much smaller for larger widths. This indicates that our upperbound O(n^{-1/2}) is likely not tight for large widths and we currently do not have an explanation for this sudden change in convergence rate.

— A Proof of Learning Rate Transfer under $μ$P (2511.01734 - Hayou, 3 Nov 2025) in Section 3.1 (Learning Rate Transfer under μP), Results paragraph

Explain the sudden change in convergence rate of the optimal learning rate at large width under μP

Background

References

Related Problems