Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
87 tokens/sec
Gemini 2.5 Pro Premium
36 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
39 tokens/sec
GPT-4o
95 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
460 tokens/sec
Kimi K2 via Groq Premium
219 tokens/sec
2000 character limit reached

No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation (2407.20650v1)

Published 30 Jul 2024 in cs.LG and cs.AI

Abstract: In recent studies, line search methods have been demonstrated to significantly enhance the performance of conventional stochastic gradient descent techniques across various datasets and architectures, while making an otherwise critical choice of learning rate schedule superfluous. In this paper, we identify problems of current state-of-the-art of line search methods, propose enhancements, and rigorously assess their effectiveness. Furthermore, we evaluate these methods on orders of magnitude larger datasets and more complex data domains than previously done. More specifically, we enhance the Armijo line search method by speeding up its computation and incorporating a momentum term into the Armijo criterion, making it better suited for stochastic mini-batching. Our optimization approach outperforms both the previous Armijo implementation and a tuned learning rate schedule for the Adam and SGD optimizers. Our evaluation covers a diverse range of architectures, such as Transformers, CNNs, and MLPs, as well as data domains, including NLP and image data. Our work is publicly available as a Python package, which provides a simple Pytorch optimizer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. R. M. Schmidt, F. Schneider, and P. Hennig, “Descending through a crowded valley-benchmarking deep learning optimizers,” in International Conference on Machine Learning.   PMLR, 2021, pp. 9367–9376.
  2. S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien, “Painless stochastic gradient: Interpolation, line-search, and convergence rates,” NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
  3. M. Mahsereci and P. Hennig, “Probabilistic line searches for stochastic optimization,” Advances in neural information processing systems, vol. 28, 2015.
  4. R. Bollapragada, J. Nocedal, D. Mudigere, H.-J. Shi, and P. T. P. Tang, “A progressive batching l-bfgs method for machine learning,” in International Conference on Machine Learning.   PMLR, 2018, pp. 620–629.
  5. C. Paquette and K. Scheinberg, “A stochastic line search method with expected complexity analysis,” SIAM Journal on Optimization, vol. 30, no. 1, pp. 349–376, 2020.
  6. P. Kenneweg, L. Galli, T. Kenneweg, and B. Hammer, “Faster convergence for transformer fine-tuning with line search methods,” in 2023 International Joint Conference on Neural Networks (IJCNN), 2023, pp. 1–8.
  7. S. Vaswani, I. H. Laradji, F. Kunstner, S. Y. Meng, M. Schmidt, and S. Lacoste-Julien, “Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search),” 2021. [Online]. Available: https://openreview.net/forum?id=GSTrduvZSjT
  8. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
  9. F. Kunstner, J. Chen, J. W. Lavington, and M. Schmidt, “Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=a65YK0cqH8g
  10. L. Armijo, “Minimization of functions having lipschitz continuous first partial derivatives,” Pacific Journal of mathematics, vol. 16, no. 1, pp. 1–3, 1966.
  11. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s transformers: State-of-the-art natural language processing,” 2019. [Online]. Available: https://arxiv.org/abs/1910.03771
  12. A. Karpathy, “nanogpt,” https://github.com/karpathy/nanoGPT, 2023.
  13. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
  14. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.   Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 353–355. [Online]. Available: https://aclanthology.org/W18-5446
  15. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  16. A. Krizhevsky, “Learning multiple layers of features from tiny images,” pp. 32–33, 2009. [Online]. Available: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
  17. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  18. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  19. H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
  20. J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. 61, pp. 2121–2159, 2011. [Online]. Available: http://jmlr.org/papers/v12/duchi11a.html
  21. L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=rkgz2aEKDr
  22. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7
  23. G. Hinton and N. S. K. Swersky, “Lecture notes neural networks for machine learning,” 2014. [Online]. Available: https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
  24. K. Nar and S. Sastry, “Step size matters in deep learning,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  25. L. Galli, H. Rauhut, and M. Schmidt, “Don’t be so monotone: Relaxing stochastic line search in over-parameterized models,” 2023.
  26. P. Kenneweg, A. Schulz, S. Schröder, and B. Hammer, “Intelligent learning rate distribution to reduce catastrophic forgetting in transformers,” in Intelligent Data Engineering and Automated Learning – IDEAL 2022, H. Yin, D. Camacho, and P. Tino, Eds.   Cham: Springer International Publishing, 2022, pp. 252–261.
  27. D. Granziol, S. Zohren, and S. Roberts, “Learning rates as a function of batch size: A random matrix theory approach to neural network training,” J. Mach. Learn. Res, vol. 23, pp. 1–65, 2022.
  28. M. J. Streeter and J. V. Dillon, “Automatically bounding the taylor remainder series: Tighter bounds and new applications,” ArXiv, vol. abs/2212.11429, 2022.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets