Extend the μP learning-rate transfer proof to nonlinear MLPs and other optimizers
Establish a rigorous proof of learning-rate transfer with width under Maximal Update Parametrization (μP) for non-linear multi-layer perceptrons with activation functions (e.g., ReLU) and for training algorithms beyond gradient descent (e.g., Adam), including the development of proof techniques to control large-width deviations in these settings.
References
While our results are limited to linear networks trained with GD, we believe they can be extended to non-linear MLPs and different optimizers. However, this will likely require different proof machinery especially when dealing when large-width deviations. We leave this question for future work.
— A Proof of Learning Rate Transfer under $μ$P
(2511.01734 - Hayou, 3 Nov 2025) in Discussion and Limitations