The Road Less Scheduled (2405.15682v4)
Abstract: Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available at https://github.com/facebookresearch/schedule_free. Schedule-Free AdamW is the core algorithm behind our winning entry to the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge Self-Tuning track.
- Non-strongly-convex smooth stochastic approximation with convergence rate Oβ’(1/n)π1πO(1/n)italic_O ( 1 / italic_n ). In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volumeΒ 26. Curran Associates, Inc.
- Relational inductive biases, deep learning, and graph networks.
- Findings of the 2017 conference on machine translation (wmt17). In Proceedings of the 2017 Conference on Machine Translation (WMT17).
- On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050β2057.
- Report on the 11th IWSLT evaluation campaign. In IWSLT.
- Online optimization with gradual variations. In Conference on Learning Theory, pages 6β1. JMLR Workshop and Conference Proceedings.
- Criteo (2022). Criteo 1TB click logs dataset. https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/.
- Cutkosky, A. (2019). Anytime online-to-batch, optimism and acceleration. In International conference on machine learning, pages 1446β1454. PMLR.
- Benchmarking Neural Network Training Algorithms.
- Defazio, A. (2020). Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization.
- When, why and how much? adaptive learning rate scheduling by refinement.
- The power of factorial powers: New parameter settings for (stochastic) optimization. In Balasubramanian, V.Β N. and Tsang, I., editors, Proceedings of The 13th Asian Conference on Machine Learning, volume 157 of Proceedings of Machine Learning Research, pages 49β64. PMLR.
- Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. Journal of Machine Learning Research, 23:1β34.
- Learning-rate-free learning by D-adaptation. The 40th International Conference on Machine Learning (ICML 2023).
- Scaling vision transformers to 22 billion parameters. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 7480β7512. PMLR.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61).
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
- Conformer: Convolution-augmented transformer for speech recognition.
- Hazan, E. (2022). Introduction to online convex optimization. MIT Press.
- Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80:165β188.
- Masked autoencoders are scalable vision learners. arXiv:2111.06377.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Open graph benchmark: datasets for machine learning on graphs. In Proceedings of the 34th International Conference on Neural Information Processing Systems.
- Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261β2269.
- Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence (UAI).
- Display advertising challenge.
- A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, and variational bounds. In International Conference on Algorithmic Learning Theory, pages 681β720. PMLR.
- A simpler approach to accelerated optimization: iterative averaging meets optimism. In International conference on machine learning, pages 4984β4993. PMLR.
- Kaddour, J. (2022). Stop wasting my time! saving days of ImageNet and BERT training with latest weight averaging.
- UniXGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32.
- Adam: a method for stochastic optimization. In International Conference on Learning Representations.
- A simpler approach to obtaining an oβ’(1/t)π1π‘o(1/t)italic_o ( 1 / italic_t ) convergence rate for the projected stochastic subgradient method.
- Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365β397.
- Deep learning recommendation model for personalization and recommendation systems. CoRR.
- Nesterov, Y. (1983). A method for solving a convex programming problem with convergence rate Oβ’(1/k2)π1superscriptπ2O(1/k^{2})italic_O ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Soviet Mathematics Doklady.
- Nesterov, Y. (2013). Lectures on Convex Optimization. Springer Nature.
- Quasi-monotone subgradient methods for nonsmooth convex minimization. Journal of Optimization Theory and Applications, 165(3):917β940.
- Orabona, F. (2019). A modern introduction to online learning. arXiv preprint arXiv:1912.13213.
- Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206β5210.
- Polyak, B. (1990). New stochastic approximation type procedures. Avtomatica i Telemekhanika, 7:98β107.
- Fast benchmarking of accuracy vs. training time with cyclic learning rates.
- Language models are unsupervised multitask learners. Technical report, OpenAI.
- Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning.
- Online learning with predictable sequences. In Conference on Learning Theory, pages 993β1019. PMLR.
- On the convergence of Adam and beyond. In International Conference on Learning Representations.
- Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report, Cornell University.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3).
- Early weight averaging meets high learning rates for LLM pre-training.
- On the (asymptotic) convergence of stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, COLT 2021, Proceedings of Machine Learning Research. PMLR.
- Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning.
- End-to-end variational networks for accelerated MRI reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer.
- On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28. JMLR.org.
- Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818β2826.
- Primal averaging: A new gradient evaluation step to attain the optimal individual convergence. IEEE Transactions on Cybernetics, PP:1β11.
- Attention is all you need. In Guyon, I., Luxburg, U.Β V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volumeΒ 30. Curran Associates, Inc.
- Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Rethinking "batch" in batchnorm.
- Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC).
- Exact convergence rate of the last iterate in subgradient methods.
- fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839.
- Lookahead optimizer: kπkitalic_k steps forward, 1 step back. In Wallach, H., Larochelle, H., Beygelzimer, A., d'AlchΓ©-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volumeΒ 32. Curran Associates, Inc.
- Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, pages 928β935.