Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Road Less Scheduled (2405.15682v4)

Published 24 May 2024 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available at https://github.com/facebookresearch/schedule_free. Schedule-Free AdamW is the core algorithm behind our winning entry to the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge Self-Tuning track.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Non-strongly-convex smooth stochastic approximation with convergence rate O⁒(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ). In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volumeΒ 26. Curran Associates, Inc.
  2. Relational inductive biases, deep learning, and graph networks.
  3. Findings of the 2017 conference on machine translation (wmt17). In Proceedings of the 2017 Conference on Machine Translation (WMT17).
  4. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057.
  5. Report on the 11th IWSLT evaluation campaign. In IWSLT.
  6. Online optimization with gradual variations. In Conference on Learning Theory, pages 6–1. JMLR Workshop and Conference Proceedings.
  7. Criteo (2022). Criteo 1TB click logs dataset. https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/.
  8. Cutkosky, A. (2019). Anytime online-to-batch, optimism and acceleration. In International conference on machine learning, pages 1446–1454. PMLR.
  9. Benchmarking Neural Network Training Algorithms.
  10. Defazio, A. (2020). Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization.
  11. When, why and how much? adaptive learning rate scheduling by refinement.
  12. The power of factorial powers: New parameter settings for (stochastic) optimization. In Balasubramanian, V.Β N. and Tsang, I., editors, Proceedings of The 13th Asian Conference on Machine Learning, volume 157 of Proceedings of Machine Learning Research, pages 49–64. PMLR.
  13. Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. Journal of Machine Learning Research, 23:1–34.
  14. Learning-rate-free learning by D-adaptation. The 40th International Conference on Machine Learning (ICML 2023).
  15. Scaling vision transformers to 22 billion parameters. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7480–7512. PMLR.
  16. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61).
  17. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
  18. Conformer: Convolution-augmented transformer for speech recognition.
  19. Hazan, E. (2022). Introduction to online convex optimization. MIT Press.
  20. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80:165–188.
  21. Masked autoencoders are scalable vision learners. arXiv:2111.06377.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
  23. Open graph benchmark: datasets for machine learning on graphs. In Proceedings of the 34th International Conference on Neural Information Processing Systems.
  24. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269.
  25. Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence (UAI).
  26. Display advertising challenge.
  27. A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, and variational bounds. In International Conference on Algorithmic Learning Theory, pages 681–720. PMLR.
  28. A simpler approach to accelerated optimization: iterative averaging meets optimism. In International conference on machine learning, pages 4984–4993. PMLR.
  29. Kaddour, J. (2022). Stop wasting my time! saving days of ImageNet and BERT training with latest weight averaging.
  30. UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32.
  31. Adam: a method for stochastic optimization. In International Conference on Learning Representations.
  32. A simpler approach to obtaining an o⁒(1/t)π‘œ1𝑑o(1/t)italic_o ( 1 / italic_t ) convergence rate for the projected stochastic subgradient method.
  33. Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397.
  34. Deep learning recommendation model for personalization and recommendation systems. CoRR.
  35. Nesterov, Y. (1983). A method for solving a convex programming problem with convergence rate O⁒(1/k2)𝑂1superscriptπ‘˜2O(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Soviet Mathematics Doklady.
  36. Nesterov, Y. (2013). Lectures on Convex Optimization. Springer Nature.
  37. Quasi-monotone subgradient methods for nonsmooth convex minimization. Journal of Optimization Theory and Applications, 165(3):917–940.
  38. Orabona, F. (2019). A modern introduction to online learning. arXiv preprint arXiv:1912.13213.
  39. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  40. Polyak, B. (1990). New stochastic approximation type procedures. Avtomatica i Telemekhanika, 7:98–107.
  41. Fast benchmarking of accuracy vs. training time with cyclic learning rates.
  42. Language models are unsupervised multitask learners. Technical report, OpenAI.
  43. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning.
  44. Online learning with predictable sequences. In Conference on Learning Theory, pages 993–1019. PMLR.
  45. On the convergence of Adam and beyond. In International Conference on Learning Representations.
  46. Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report, Cornell University.
  47. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3).
  48. Early weight averaging meets high learning rates for LLM pre-training.
  49. On the (asymptotic) convergence of stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, COLT 2021, Proceedings of Machine Learning Research. PMLR.
  50. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning.
  51. End-to-end variational networks for accelerated MRI reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer.
  52. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28. JMLR.org.
  53. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826.
  54. Primal averaging: A new gradient evaluation step to attain the optimal individual convergence. IEEE Transactions on Cybernetics, PP:1–11.
  55. Attention is all you need. In Guyon, I., Luxburg, U.Β V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volumeΒ 30. Curran Associates, Inc.
  56. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  57. Rethinking "batch" in batchnorm.
  58. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC).
  59. Exact convergence rate of the last iterate in subgradient methods.
  60. fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839.
  61. Lookahead optimizer: kπ‘˜kitalic_k steps forward, 1 step back. In Wallach, H., Larochelle, H., Beygelzimer, A., d'AlchΓ©-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volumeΒ 32. Curran Associates, Inc.
  62. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, pages 928–935.
Citations (21)

Summary

  • The paper introduces Schedule-Free learning, a novel method that replaces preset learning rate schedules with a specific iterate averaging and interpolation strategy to achieve optimal convergence.
  • The approach uses a flexible momentum-like parameter, enabling performance that matches or outperforms highly tuned schedules across 28 diverse machine learning tasks.
  • A new online-to-batch conversion theorem unifies prior averaging methods, ensuring optimal worst-case convergence rates for non-smooth convex functions without increasing computational or memory costs.

This paper, "The Road Less Scheduled" (2405.15682), addresses a long-standing practical challenge in machine learning optimization: the reliance on learning rate schedules that require knowing the total training duration (TT) in advance. While convex optimization theory often suggests iterate averaging methods like Polyak-Ruppert achieve optimal convergence rates, empirical practice strongly favors using the last iterate of gradient descent with carefully tuned learning rate schedules (like cosine decay or step decay), despite theoretical gaps. The need to pre-specify TT for schedules is a significant practical limitation, as optimal training time is often unknown.

The authors propose a novel optimization approach called Schedule-Free (SF) learning that aims to bridge this theory-practice gap. The core idea is to replace learning rate schedules entirely with a specific form of iterate averaging combined with an interpolated point for gradient evaluation. The method maintains three sequences of parameters:

  1. ztz_t: A fast-moving sequence updated similarly to standard SGD or Adam (e.g., zt+1=ztβˆ’Ξ³βˆ‡f(yt,ΞΆt)z_{t+1} = z_t - \gamma \nabla f(y_t, \zeta_t)).
  2. xtx_t: An equal-weighted average of the ztz_t sequence up to time tt (specifically, xt+1=(1βˆ’ct+1)xt+ct+1zt+1x_{t+1} = (1-c_{t+1})x_t + c_{t+1}z_{t+1} with ct+1=1/(t+1)c_{t+1}=1/(t+1) in the basic version, or weighted by Ξ³t2\gamma_t^2 during warmup).
  3. yty_t: The point where the gradient is computed, defined as an interpolation between ztz_t and xtx_t: yt=(1βˆ’Ξ²)zt+Ξ²xty_t = (1-\beta)z_t + \beta x_t.

The method introduces a single momentum-like hyperparameter β∈[0,1]\beta \in [0,1]. Setting β=0\beta=0 recovers Polyak-Ruppert averaging (gradient evaluated at ztz_t), and β=1\beta=1 recovers Primal averaging (gradient evaluated at xtx_t). The authors find that values of β\beta around $0.9$ work well in practice, analogous to typical momentum values.

A key theoretical contribution is a new online-to-batch conversion theorem (Theorem 1 and the general Theorem 3 in Appendix A/B). This theorem shows how bounds on the "regret" of the ztz_t sequence in online convex optimization translate directly into convergence guarantees for the average sequence xTx_T in stochastic optimization. The theorem unifies previous online-to-batch results, including those corresponding to Polyak averaging, Primal averaging, and even the recent linear decay schedule theory. For Schedule-Free SGD, the theory shows that it achieves the optimal worst-case convergence rate of O(DG/T)\mathcal{O}(DG/\sqrt{T}) for non-smooth convex functions, regardless of the choice of β∈[0,1]\beta \in [0,1]. This is a notable theoretical advantage over traditional momentum, which can worsen worst-case rates in this setting. The paper also explores generalizations using Bregman divergences (Appendix C) and shows potential for accelerated rates with optimistic gradient methods (Appendix D) and improved rates for strongly convex problems (Appendix E).

Empirically, Schedule-Free learning demonstrates state-of-the-art performance across a wide range of 28 problems, from convex logistic regression to large-scale deep learning tasks in computer vision, natural language processing, recommendation systems, and medical imaging. Experiments show that SF methods consistently match or outperform highly tuned learning rate schedules (cosine decay, linear decay) and significantly outperform traditional averaging methods. A key finding is that Schedule-Free methods track the Pareto frontier of loss versus training time during a single run, providing good performance at any stopping point without pre-specification. The experiments also show that Schedule-Free momentum (Ξ²<1\beta < 1) enables the use of larger learning rates, which may contribute to faster convergence.

Practical Implementation Details:

  • Memory: Schedule-Free variants have the same memory requirements as their base optimizers. For example, Schedule-Free SGD needs to store xx and zz, similar to how standard SGD with momentum stores the current parameters and a momentum buffer. The intermediate point yty_t does not need separate storage as it can be computed from xtx_t and ztz_t.
  • Batch Normalization: Models using BatchNorm layers require special handling. Since the gradient is computed at yty_t, the running statistics (mean and variance) used by BatchNorm layers need to reflect the xtx_t sequence parameters for evaluation. This can be achieved by running a small number of training batches through the model using the xtx_t parameters before each evaluation.
  • Warmup: Learning rate warmup is still beneficial for Schedule-Free methods in deep learning. When using warmup, the authors found it improves performance to weight the averaging coefficient ct+1c_{t+1} by the square of the current learning rate Ξ³t\gamma_t, using ct+1=Ξ³t2/βˆ‘i=1tΞ³i2c_{t+1} = \gamma_t^2 / \sum_{i=1}^t \gamma_i^2. This weighting strategy is motivated by the general theory (Theorem 3).
  • Weight Decay: Weight decay can be applied to either the yty_t or ztz_t sequence. Applying decay to yty_t aligns with the interpretation of weight decay as an L2 regularization term added to the loss function.
  • Hyperparameters: Schedule-Free learning removes the need for schedule-specific hyperparameters (like the total number of steps TT for cosine decay). It introduces the Ξ²\beta parameter, but authors show that a default value like 0.9 works well across many tasks, requiring minimal tuning. However, users still need to tune the base learning rate Ξ³\gamma and weight decay, and the optimal values may differ from those used with scheduled optimizers.
  • An open-source implementation is provided for researchers and practitioners to use.

In summary, Schedule-Free learning offers a promising alternative to learning rate schedules, providing competitive or superior empirical performance without the need to know the training duration beforehand. It is presented as a viable drop-in replacement for schedules in various machine learning tasks, with comparable computational and memory costs.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. The Road Less Scheduled (3 points, 0 comments)