Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Stochastic Quasi-Newton Method for Non-convex Optimization with Non-uniform Smoothness (2403.15244v2)

Published 22 Mar 2024 in cs.LG and math.OC

Abstract: Classical convergence analyses for optimization algorithms rely on the widely-adopted uniform smoothness assumption. However, recent experimental studies have demonstrated that many machine learning problems exhibit non-uniform smoothness, meaning the smoothness factor is a function of the model parameter instead of a universal constant. In particular, it has been observed that the smoothness grows with respect to the gradient norm along the training trajectory. Motivated by this phenomenon, the recently introduced $(L_0, L_1)$-smoothness is a more general notion, compared to traditional $L$-smoothness, that captures such positive relationship between smoothness and gradient norm. Under this type of non-uniform smoothness, existing literature has designed stochastic first-order algorithms by utilizing gradient clipping techniques to obtain the optimal $\mathcal{O}(\epsilon{-3})$ sample complexity for finding an $\epsilon$-approximate first-order stationary solution. Nevertheless, the studies of quasi-Newton methods are still lacking. Considering higher accuracy and more robustness for quasi-Newton methods, in this paper we propose a fast stochastic quasi-Newton method when there exists non-uniformity in smoothness. Leveraging gradient clipping and variance reduction, our algorithm can achieve the best-known $\mathcal{O}(\epsilon{-3})$ sample complexity and enjoys convergence speedup with simple hyperparameter tuning. Our numerical experiments show that our proposed algorithm outperforms the state-of-the-art approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. Advances in neural information processing systems, 31, 2018.
  2. Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In International conference on machine learning, pages 699–707. PMLR, 2016.
  3. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1-2):165–214, 2023.
  4. S. P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  5. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
  6. A stochastic quasi-newton method for large-scale nonconvex optimization with applications. IEEE transactions on neural networks and learning systems, 31(11):4776–4790, 2019.
  7. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014.
  8. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018.
  9. Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR, 2017.
  10. S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  11. S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
  12. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):1–39, 2013.
  13. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.
  14. J. M. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pages 1895–1904. PMLR, 2017.
  15. Non-convex finite-sum optimization via scsg methods. Advances in Neural Information Processing Systems, 30, 2017.
  16. Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International conference on machine learning, pages 6286–6295. PMLR, 2021.
  17. Zerosarah: Efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447, 2021.
  18. D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
  19. A variance reduced stochastic newton method. arXiv preprint arXiv:1503.08316, 2015.
  20. Leveraging non-uniformity in first-order non-convex optimization. In International Conference on Machine Learning, pages 7555–7564. PMLR, 2021.
  21. Regularizing and optimizing lstm language models. arxiv 2017. arXiv preprint arXiv:1708.02182, 2017.
  22. A. Mokhtari and A. Ribeiro. Res: Regularized stochastic bfgs algorithm. IEEE Transactions on Signal Processing, 62(23):6089–6104, 2014.
  23. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pages 249–258. PMLR, 2016.
  24. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International conference on machine learning, pages 2613–2621. PMLR, 2017.
  25. J. Nocedal and S. J. Wright. Numerical optimization. Springer, 1999.
  26. Knowledge enhanced contextual word representations. arXiv preprint arXiv:1909.04164, 2019.
  27. Proxsarah: An efficient algorithmic framework for stochastic composite nonconvex optimization. The Journal of Machine Learning Research, 21(1):4455–4502, 2020.
  28. Understanding gradient clipping in incremental gradient methods. In International Conference on Artificial Intelligence and Statistics, pages 1504–1512. PMLR, 2021.
  29. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323. PMLR, 2016.
  30. Variance-reduced clipping for non-convex optimization. arXiv preprint arXiv:2303.00883, 2023.
  31. H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  32. A stochastic quasi-newton method for online convex optimization. In Artificial intelligence and statistics, pages 436–443. PMLR, 2007.
  33. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In International Conference on Machine Learning, pages 604–612. PMLR, 2014.
  34. Convergence of stein variational gradient descent under a weaker smoothness condition. In International Conference on Artificial Intelligence and Statistics, pages 3693–3717. PMLR, 2023.
  35. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2):927–956, 2017.
  36. Convolutional neural networks: an overview and application in radiology. Insights into imaging, 9:611–629, 2018.
  37. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
  38. Faster stochastic quasi-newton methods. IEEE Transactions on Neural Networks and Learning Systems, 33(9):4388–4397, 2021.
  39. On the convergence and improvement of stochastic normalized gradient descent. Science China Information Sciences, 64:1–13, 2021.
  40. Stochastic variance-reduced cubic regularized newton methods. In International Conference on Machine Learning, pages 5990–5999. PMLR, 2018.
  41. Stochastic nested variance reduction for nonconvex optimization. The Journal of Machine Learning Research, 21(1):4130–4192, 2020.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets