Transfer of SGD learning-rate schedules to adaptive methods

Explain, under a formal optimization framework, why learning-rate schedules designed and analyzed for stochastic gradient descent (such as linear decay or constant-plus-cooldown) often transfer effectively to adaptive methods like Adam, and derive conditions under which such cross-optimizer schedule transfer is guaranteed.

Background

The paper surveys common schedule families (warmup, stable phase, decay) and recent theoretical results supporting linear decay and constant+cooldown for first-order methods. It then observes that practitioners routinely apply these schedules to adaptive optimizers, despite differences in update geometry and noise handling.

This raises a theoretical gap: a rigorous explanation of why SGD-designed schedules work for Adam and related methods, and when schedule transfer should be expected to hold.

References

An open question is why learning rate schedules designed and analyzed for SGD often transfer effectively to adaptive methods like Adam.

— Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale (2512.18373 - Nagwekar, 20 Dec 2025) in Subsection “Interplay with μP and Optimizer Choice” within Section “Learning Rate Schedules”

Transfer of SGD learning-rate schedules to adaptive methods

Sponsor

Background

References

Related Problems