A Stochastic Quasi-Newton Method for Non-convex Optimization with Non-uniform Smoothness (2403.15244v2)
Abstract: Classical convergence analyses for optimization algorithms rely on the widely-adopted uniform smoothness assumption. However, recent experimental studies have demonstrated that many machine learning problems exhibit non-uniform smoothness, meaning the smoothness factor is a function of the model parameter instead of a universal constant. In particular, it has been observed that the smoothness grows with respect to the gradient norm along the training trajectory. Motivated by this phenomenon, the recently introduced $(L_0, L_1)$-smoothness is a more general notion, compared to traditional $L$-smoothness, that captures such positive relationship between smoothness and gradient norm. Under this type of non-uniform smoothness, existing literature has designed stochastic first-order algorithms by utilizing gradient clipping techniques to obtain the optimal $\mathcal{O}(\epsilon{-3})$ sample complexity for finding an $\epsilon$-approximate first-order stationary solution. Nevertheless, the studies of quasi-Newton methods are still lacking. Considering higher accuracy and more robustness for quasi-Newton methods, in this paper we propose a fast stochastic quasi-Newton method when there exists non-uniformity in smoothness. Leveraging gradient clipping and variance reduction, our algorithm can achieve the best-known $\mathcal{O}(\epsilon{-3})$ sample complexity and enjoys convergence speedup with simple hyperparameter tuning. Our numerical experiments show that our proposed algorithm outperforms the state-of-the-art approaches.
- Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. Advances in neural information processing systems, 31, 2018.
- Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In International conference on machine learning, pages 699–707. PMLR, 2016.
- Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214, 2023.
- S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
- A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
- A stochastic quasi-newton method for large-scale nonconvex optimization with applications. IEEE transactions on neural networks and learning systems, 31(11):4776–4790, 2019.
- Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014.
- Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018.
- Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR, 2017.
- S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
- Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):1–39, 2013.
- R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.
- J. M. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pages 1895–1904. PMLR, 2017.
- Non-convex finite-sum optimization via scsg methods. Advances in Neural Information Processing Systems, 30, 2017.
- Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International conference on machine learning, pages 6286–6295. PMLR, 2021.
- Zerosarah: Efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447, 2021.
- D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
- A variance reduced stochastic newton method. arXiv preprint arXiv:1503.08316, 2015.
- Leveraging non-uniformity in first-order non-convex optimization. In International Conference on Machine Learning, pages 7555–7564. PMLR, 2021.
- Regularizing and optimizing lstm language models. arxiv 2017. arXiv preprint arXiv:1708.02182, 2017.
- A. Mokhtari and A. Ribeiro. Res: Regularized stochastic bfgs algorithm. IEEE Transactions on Signal Processing, 62(23):6089–6104, 2014.
- A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249–258. PMLR, 2016.
- Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International conference on machine learning, pages 2613–2621. PMLR, 2017.
- J. Nocedal and S. J. Wright. Numerical optimization. Springer, 1999.
- Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164, 2019.
- Proxsarah: An efficient algorithmic framework for stochastic composite nonconvex optimization. The Journal of Machine Learning Research, 21(1):4455–4502, 2020.
- Understanding gradient clipping in incremental gradient methods. In International Conference on Artificial Intelligence and Statistics, pages 1504–1512. PMLR, 2021.
- Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323. PMLR, 2016.
- Variance-reduced clipping for non-convex optimization. arXiv preprint arXiv:2303.00883, 2023.
- H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- A stochastic quasi-newton method for online convex optimization. In Artificial intelligence and statistics, pages 436–443. PMLR, 2007.
- Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In International Conference on Machine Learning, pages 604–612. PMLR, 2014.
- Convergence of stein variational gradient descent under a weaker smoothness condition. In International Conference on Artificial Intelligence and Statistics, pages 3693–3717. PMLR, 2023.
- Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927–956, 2017.
- Convolutional neural networks: an overview and application in radiology. Insights into imaging, 9:611–629, 2018.
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
- Faster stochastic quasi-newton methods. IEEE Transactions on Neural Networks and Learning Systems, 33(9):4388–4397, 2021.
- On the convergence and improvement of stochastic normalized gradient descent. Science China Information Sciences, 64:1–13, 2021.
- Stochastic variance-reduced cubic regularized newton methods. In International Conference on Machine Learning, pages 5990–5999. PMLR, 2018.
- Stochastic nested variance reduction for nonconvex optimization. The Journal of Machine Learning Research, 21(1):4130–4192, 2020.