Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the fast convergence of minibatch heavy ball momentum (2206.07553v4)

Published 15 Jun 2022 in cs.LG, cs.DS, cs.NA, math.NA, math.OC, and stat.ML

Abstract: Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum retains the fast linear rate of (deterministic) heavy ball momentum on quadratic optimization problems, at least when minibatching with a sufficiently large batch size. The algorithm we study can be interpreted as an accelerated randomized Kaczmarz algorithm with minibatching and heavy ball momentum. The analysis relies on carefully decomposing the momentum transition matrix, and using new spectral norm concentration bounds for products of independent random matrices. We provide numerical illustrations demonstrating that our bounds are reasonably sharp.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, 18(221):1–51, 2018.
  2. Robust accelerated gradient methods for smooth strongly convex functions. SIAM Journal on Optimization, 30(1):717–751, 2020.
  3. A geometric alternative to nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.
  4. Accelerated linear convergence of stochastic momentum methods in Wasserstein distances. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 891–901. PMLR, 09–15 Jun 2019.
  5. The masked sample covariance estimator: an analysis using matrix concentration inequalities. Information and Inference, 1(1):2–20, May 2012.
  6. A robust accelerated optimization algorithm for strongly convex functions. In 2018 Annual American Control Conference (ACC), pages 1376–1381. IEEE, 2018.
  7. Aaron Defazio. A simple practical accelerated method for finite sums. Advances in neural information processing systems, 29, 2016.
  8. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  9. A simple convergence proof of adam and adagrad, 2020.
  10. From averaging to acceleration, there is only a step-size. In Conference on Learning Theory, pages 658–695. PMLR, 2015.
  11. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning, pages 2540–2548. PMLR, 2015.
  12. Stochastic heavy ball. Electronic Journal of Statistics, 12(1):461 – 529, 2018.
  13. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
  14. Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems, 32, 2019.
  15. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards, 49(6):409–436, 1952.
  16. Matrix concentration for products. Foundations of Computational Mathematics, pages 1–33, 2021.
  17. Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory, pages 545–604. PMLR, 2018.
  18. Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pages 1042–1085. PMLR, 2018.
  19. Stefan M. Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen. 35:355–357, 1937.
  20. On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018.
  21. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  22. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  23. Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions. Advances in Neural Information Processing Systems, 35:36944–36957, 2022.
  24. Krylov subspace methods: principles and analysis. Numerical mathematics and scientific computation. Oxford University Press, 1st ed edition, 2013.
  25. Catalyst acceleration for first-order convex optimization: from theory to practice. Journal of Machine Learning Research, 18(1):7854–7907, 2018.
  26. Ji Liu and Stephen J. Wright. An accelerated randomized Kaczmarz algorithm. Mathematics of Computation, 85(297):153–178, May 2015.
  27. An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261–18271, 2020.
  28. Linearly convergent stochastic heavy ball method for minimizing generalization error. arXiv preprint arXiv:1710.10737, 2017.
  29. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77(3):653–710, 2020.
  30. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3325–3334. PMLR, 10–15 Jul 2018.
  31. Adaptive bound optimization for online convex optimization. COLT, 2010.
  32. Randomized Kaczmarz with averaging. BIT Numerical Mathematics, 61(1):337–359, August 2020.
  33. Deanna Needell. Randomized Kaczmarz solver for noisy linear systems. BIT Numerical Mathematics, 50(2):395–403, April 2010.
  34. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. NIPS’14, page 1017–1025, Cambridge, MA, USA, 2014. MIT Press.
  35. Paved with good intentions: Analysis of a randomized block Kaczmarz method. Linear Algebra and its Applications, 441:199–221, January 2014.
  36. Batched stochastic gradient descent with weighted sampling. In International Conference Approximation Theory, pages 279–306. Springer, 2016.
  37. Y. Nesterov. A method of solving a convex programming problem with convergence rate o⁢(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Soviet Mathematics Doklady, 27(2):372–376, 1983.
  38. Yurii Nesterov. Introductory Lectures on Convex Optimization: a Basic Course. Springer, New York, NY, 2013.
  39. B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, January 1964.
  40. Benjamin Recht. Cs726-lyapunov analysis and the heavy ball method. Department of Computer Sciences, University of Wisconsin–Madison, 2010.
  41. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, pages 3935–3971. PMLR, 2021.
  42. Zdenek Strakos. On the real convergence rate of the conjugate gradient method. Linear Algebra and its Applications, 154-156:535 – 549, 1991.
  43. Open questions in the convergence analysis of the Lanczos process for the real symmetric eigenvalue problem. University of Minnesota, 1992.
  44. A randomized Kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262–278, April 2008.
  45. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
  46. Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
  47. The ASTRA toolbox: A platform for advanced algorithm development in electron tomography. Ultramicroscopy, 157:35–47, October 2015.
  48. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2017.
  49. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019.
  50. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pages 6677–6686. PMLR, 2019.
  51. A unified analysis of stochastic momentum methods for deep learning. arXiv preprint arXiv:1808.10396, 2018.
  52. Direct acceleration of saga using sampled negative momentum. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1602–1610. PMLR, 2019.
Citations (16)

Summary

We haven't generated a summary for this paper yet.