Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
Gemini 2.5 Pro Premium
51 tokens/sec
GPT-5 Medium
34 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
115 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
453 tokens/sec
Kimi K2 via Groq Premium
140 tokens/sec
2000 character limit reached

Improving Line Search Methods for Large Scale Neural Network Training (2403.18519v1)

Published 27 Mar 2024 in cs.LG and cs.AI

Abstract: In recent studies, line search methods have shown significant improvements in the performance of traditional stochastic gradient descent techniques, eliminating the need for a specific learning rate schedule. In this paper, we identify existing issues in state-of-the-art line search methods, propose enhancements, and rigorously evaluate their effectiveness. We test these methods on larger datasets and more complex data domains than before. Specifically, we improve the Armijo line search by integrating the momentum term from ADAM in its search direction, enabling efficient large-scale training, a task that was previously prone to failure using Armijo line search methods. Our optimization approach outperforms both the previous Armijo implementation and tuned learning rate schedules for Adam. Our evaluation focuses on Transformers and CNNs in the domains of NLP and image data. Our work is publicly available as a Python package, which provides a hyperparameter free Pytorch optimizer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
  2. L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” in International Conference on Learning Representations, 2020.
  3. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019.
  4. G. Hinton and N. S. K. Swersky, “Lecture notes neural networks for machine learning,” 2014.
  5. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (Y. Bengio and Y. LeCun, eds.), 2015.
  6. R. M. Schmidt, F. Schneider, and P. Hennig, “Descending through a crowded valley-benchmarking deep learning optimizers,” in International Conference on Machine Learning, pp. 9367–9376, PMLR, 2021.
  7. S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien, “Painless stochastic gradient: Interpolation, line-search, and convergence rates,” NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
  8. M. Mahsereci and P. Hennig, “Probabilistic line searches for stochastic optimization,” Advances in neural information processing systems, vol. 28, 2015.
  9. R. Bollapragada, J. Nocedal, D. Mudigere, H.-J. Shi, and P. T. P. Tang, “A progressive batching l-bfgs method for machine learning,” in International Conference on Machine Learning, pp. 620–629, PMLR, 2018.
  10. C. Paquette and K. Scheinberg, “A stochastic line search method with expected complexity analysis,” SIAM Journal on Optimization, vol. 30, no. 1, pp. 349–376, 2020.
  11. S. Vaswani, I. H. Laradji, F. Kunstner, S. Y. Meng, M. Schmidt, and S. Lacoste-Julien, “Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search),” 2021.
  12. P. Kenneweg, L. Galli, T. Kenneweg, and B. Hammer, “Faster convergence for transformer fine-tuning with line search methods,” in 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, 2023.
  13. L. Armijo, “Minimization of functions having lipschitz continuous first partial derivatives,” Pacific Journal of mathematics, vol. 16, no. 1, pp. 1–3, 1966.
  14. F. Kunstner, J. Chen, J. W. Lavington, and M. Schmidt, “Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be,” in The Eleventh International Conference on Learning Representations, 2023.
  15. A. Karpathy, “nanogpt.” https://github.com/karpathy/nanoGPT, 2023.
  16. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
  17. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  18. A. Gokaslan, V. Cohen, E. Pavlick, and S. Tellex, “Openwebtext corpus.” http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  19. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, (Brussels, Belgium), pp. 353–355, Association for Computational Linguistics, Nov. 2018.
  20. A. Krizhevsky, “Learning multiple layers of features from tiny images,” pp. 32–33, 2009.
  21. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
  22. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets