Faster Convergence for Transformer Fine-tuning with Line Search Methods (2403.18506v1)
Abstract: Recent works have shown that line search methods greatly increase performance of traditional stochastic gradient descent methods on a variety of datasets and architectures [1], [2]. In this work we succeed in extending line search methods to the novel and highly popular Transformer architecture and dataset domains in natural language processing. More specifically, we combine the Armijo line search with the Adam optimizer and extend it by subdividing the networks architecture into sensible units and perform the line search separately on these local units. Our optimization method outperforms the traditional Adam optimizer and achieves significant performance improvements for small data sets or small training budgets, while performing equal or better for other tested cases. Our work is publicly available as a python package, which provides a hyperparameter-free pytorch optimizer that is compatible with arbitrary network architectures.
- S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien, “Painless stochastic gradient: Interpolation, line-search, and convergence rates,” NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
- S. Vaswani, F. Kunstner, I. H. Laradji, S. Y. Meng, M. Schmidt, and S. Lacoste-Julien, “Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search),” CoRR, vol. abs/2006.06835, 2020.
- M. Mahsereci and P. Hennig, “Probabilistic line searches for stochastic optimization,” Advances in neural information processing systems, vol. 28, 2015.
- R. Bollapragada, J. Nocedal, D. Mudigere, H.-J. Shi, and P. T. P. Tang, “A progressive batching l-bfgs method for machine learning,” in International Conference on Machine Learning, pp. 620–629, PMLR, 2018.
- C. Paquette and K. Scheinberg, “A stochastic line search method with expected complexity analysis,” SIAM Journal on Optimization, vol. 30, no. 1, pp. 349–376, 2020.
- L. Armijo, “Minimization of functions having lipschitz continuous first partial derivatives,” Pacific Journal of mathematics, vol. 16, no. 1, pp. 1–3, 1966.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (Y. Bengio and Y. LeCun, eds.), 2015.
- F. Kunstner, J. Chen, J. W. Lavington, and M. Schmidt, “Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be,” in The Eleventh International Conference on Learning Representations, 2023.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) (J. Burstein, C. Doran, and T. Solorio, eds.), pp. 4171–4186, Association for Computational Linguistics, 2019.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, (Brussels, Belgium), pp. 353–355, Association for Computational Linguistics, Nov. 2018.
- H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
- J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. 61, pp. 2121–2159, 2011.
- G. H. with Nitish Srivastava Kevin Swersky, “Lecture notes neural networks for machine learning,” 2014.
- K. Nar and S. S. Sastry, “Step size matters in deep learning,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, (Red Hook, NY, USA), p. 3440–3448, Curran Associates Inc., 2018.
- Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
- A. Krizhevsky, “Learning multiple layers of features from tiny images,” pp. 32–33, 2009.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269, 2017.
- P. Kenneweg, A. Schulz, S. Schröder, and B. Hammer, “Intelligent learning rate distribution to reduce catastrophic forgetting in transformers,” in Intelligent Data Engineering and Automated Learning – IDEAL 2022 (H. Yin, D. Camacho, and P. Tino, eds.), (Cham), pp. 252–261, Springer International Publishing, 2022.
- L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” in International Conference on Learning Representations, 2020.
- K. Chandra, A. Xie, J. Ragan-Kelley, and E. Meijer, “Gradient descent: The ultimate optimizer,” in Advances in Neural Information Processing Systems (A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022.
- B. Singh, S. De, Y. Zhang, T. Goldstein, and G. Taylor, “Layer-specific adaptive learning rates for deep networks,” 10 2015.
- G. B. Arous, R. Gheissari, and A. Jagannath, “High-dimensional limit theorems for SGD: Effective dynamics and critical scaling,” in Advances in Neural Information Processing Systems (A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, eds.), 2022.
- L. Galli, A. Galligari, and M. Sciandrone, “A unified convergence framework for nonmonotone inexact decomposition methods,” Computational Optimization and Applications, vol. 75, no. 1, pp. 113–144, 2020.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s transformers: State-of-the-art natural language processing,” 2019.
Collections
Sign up for free to add this paper to one or more collections.