Transfer of SGD learning-rate schedules to adaptive methods
Explain, under a formal optimization framework, why learning-rate schedules designed and analyzed for stochastic gradient descent (such as linear decay or constant-plus-cooldown) often transfer effectively to adaptive methods like Adam, and derive conditions under which such cross-optimizer schedule transfer is guaranteed.
Sponsor
References
An open question is why learning rate schedules designed and analyzed for SGD often transfer effectively to adaptive methods like Adam.
— Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale
(2512.18373 - Nagwekar, 20 Dec 2025) in Subsection “Interplay with μP and Optimizer Choice” within Section “Learning Rate Schedules”