Extend the μP learning-rate transfer proof to nonlinear MLPs and other optimizers

Establish a rigorous proof of learning-rate transfer with width under Maximal Update Parametrization (μP) for non-linear multi-layer perceptrons with activation functions (e.g., ReLU) and for training algorithms beyond gradient descent (e.g., Adam), including the development of proof techniques to control large-width deviations in these settings.

Background

The main theoretical results in the paper address linear networks trained with gradient descent, proving learning-rate transfer under μP at any training step and contrasting it with failures under standard parameterizations.

Empirical results suggest learning-rate transfer may persist for nonlinear ReLU MLPs trained with Adam, but the authors do not provide a formal proof for these settings.

The authors explicitly defer this extension, noting it likely requires different proof machinery to handle large-width deviations, thus making the generalization to nonlinear models and other optimizers an open problem.

References

While our results are limited to linear networks trained with GD, we believe they can be extended to non-linear MLPs and different optimizers. However, this will likely require different proof machinery especially when dealing when large-width deviations. We leave this question for future work.

— A Proof of Learning Rate Transfer under $μ$P (2511.01734 - Hayou, 3 Nov 2025) in Discussion and Limitations

Extend the μP learning-rate transfer proof to nonlinear MLPs and other optimizers

Background

References

Related Problems