Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Random-reshuffled SARAH does not need a full gradient computations (2111.13322v2)

Published 26 Nov 2021 in cs.LG and math.OC

Abstract: The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a variance reduced variant of the Stochastic Gradient Descent (SGD) algorithm that needs a gradient of the objective function from time to time. In this paper, we remove the necessity of a full gradient computation. This is achieved by using a randomized reshuffling strategy and aggregating stochastic gradients obtained in each epoch. The aggregated stochastic gradients serve as an estimate of a full gradient in the SARAH algorithm. We provide a theoretical analysis of the proposed approach and conclude the paper with numerical experiments that demonstrate the efficiency of this approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Sgd with shuffling: optimal rates without component convexity and large epoch requirements. Advances in Neural Information Processing Systems, 33:17526–17535, 2020.
  2. Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200–1205, 2017.
  3. Z. Allen-Zhu and Y. Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In International conference on machine learning, pages 1080–1089. PMLR, 2016.
  4. Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade: Second Edition, pages 437–478. Springer, 2012.
  5. L. Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, Paris, volume 8, pages 2624–2633. Citeseer, 2009.
  6. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
  7. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  8. On acceleration with noise-corrupted gradients. In International Conference on Machine Learning, pages 1019–1028. PMLR, 2018.
  9. A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. arXiv preprint arXiv:1905.10018, 2019.
  10. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
  11. Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. arXiv preprint arXiv:1807.01695, 2018.
  12. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
  13. On the convergence rate of incremental aggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035–1048, 2017.
  14. Statistically preconditioned accelerated gradient method for distributed optimization. In International conference on machine learning, pages 4203–4227. PMLR, 2020.
  15. Efficient smooth non-convex stochastic compositional optimization via stochastic recursive gradient descent. 2019.
  16. An improved analysis and rates for variance reduction under without-replacement sampling orders. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 3232–3243. Curran Associates, Inc., 2021.
  17. Sgd without replacement: Sharper rates for general smooth convex functions. arXiv preprint arXiv:1903.01463, 2019.
  18. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26:315–323, 2013.
  19. A. Khaled and P. Richtárik. Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329, 2020.
  20. Shuffle sgd is always better than sgd: Improved analysis of sgd with arbitrary data orders. arXiv preprint arXiv:2305.19259, 2023.
  21. On the convergence of sarah and beyond. In International Conference on Artificial Intelligence and Statistics, pages 223–233. PMLR, 2020.
  22. Z. Li and P. Richtárik. Zerosarah: Efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447, 2021.
  23. Page: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In International Conference on Machine Learning, pages 6286–6295. PMLR, 2021.
  24. An optimal hybrid variance-reduced algorithm for stochastic composite nonconvex optimization. arXiv preprint arXiv:2008.09055, 2020.
  25. J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
  26. Random reshuffling with variance reduction: New analysis and better rates. arXiv preprint arXiv:2104.09342, 2021.
  27. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33, 2020.
  28. Surpassing gradient descent provably: A cyclic incremental method with linear convergence rate. SIAM Journal on Optimization, 28(2):1420–1447, 2018.
  29. E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.
  30. Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  31. SARAH: a novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621. PMLR, 2017a.
  32. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017b.
  33. New convergence aspects of stochastic gradient algorithms. J. Mach. Learn. Res., 20:176–1, 2019.
  34. Inexact SARAH algorithm for stochastic optimization. Optimization Methods and Software, 36(1):237–258, 2021a.
  35. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021b.
  36. Y. Park and E. K. Ryu. Linear convergence of cyclic saga. Optimization Letters, 14(6):1583–1598, 2020.
  37. B. T. Polyak. Introduction to optimization. 1987.
  38. Saga with arbitrary sampling. In International Conference on Machine Learning, pages 5190–5199. PMLR, 2019.
  39. B. Recht and C. Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
  40. H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  41. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
  42. S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  43. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
  44. S. U. Stich. Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232, 2019.
  45. R.-Y. Sun. Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294, 2020.
  46. General proximal incremental aggregated gradient algorithms: Better and novel results under general scheme. Advances in Neural Information Processing Systems, 32, 2019.
  47. Mini-batch primal and dual methods for svms. In International Conference on Machine Learning, pages 1022–1030. PMLR, 2013.
  48. J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12:389–434, 2012.
  49. A stronger convergence result on the proximal incremental aggregated gradient method. arXiv preprint arXiv:1611.08022, 2016.
  50. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195–1204. PMLR, 2019.
  51. Accelerating mini-batch SARAH by step size rules. Information Sciences, 558:157–173, 2021.
  52. Stochastic learning under random reshuffling with constant step-sizes. IEEE Transactions on Signal Processing, 67(2):474–489, 2018.
  53. Variance-reduced stochastic learning under random reshuffling. IEEE Transactions on Signal Processing, 68:1390–1408, 2020.
Citations (5)

Summary

We haven't generated a summary for this paper yet.