Explain the sudden change in convergence rate of the optimal learning rate at large width under μP
Determine the cause of the observed sudden change in the empirical convergence rate of the optimal learning rate for linear multi-layer perceptrons parametrized with Maximal Update Parametrization (μP) after one gradient step, where the rate matches the theoretical O(n^{-1/2}) prediction up to width n=1024 but becomes much smaller for larger widths; characterize the true asymptotic convergence rate and identify the conditions under which the upper bound O(n^{-1/2}) is not tight.
References
Interestingly, the empirical convergence rate seems to match the theoretical prediction of n{-1/2} up to width n=1024 then becomes much smaller for larger widths. This indicates that our upperbound O(n{-1/2}) is likely not tight for large widths and we currently do not have an explanation for this sudden change in convergence rate.