High Probability Guarantees for Random Reshuffling (2311.11841v2)
Abstract: We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we first investigate the concentration property of $\mathsf{RR}$'s sampling procedure and establish a new high probability sample complexity guarantee for driving the gradient (without expectation) below $\varepsilon$, which effectively characterizes the efficiency of a single $\mathsf{RR}$ execution. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor changing $\mathsf{RR}$'s updating rule. Furthermore, by leveraging our derived high probability descent property and bound on the stochastic error, we propose a simple and computable stopping criterion for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, and then $\mathsf{RR}$-$\mathsf{sc}$ returns an iterate with its gradient below $\varepsilon$ with high probability. Moreover, building on the proposed stopping criterion, we design a perturbed random reshuffling method ($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized perturbation procedure near stationary points. We derive that $\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and efficiently returns a second-order stationary point with high probability, without making any sub-Gaussian tail-type assumptions on the stochastic gradient errors. Finally, we conduct numerical experiments on neural network training to support our theoretical findings.
- Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, 2011.
- Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, volume 8, pages 2624–2633, 2009.
- Léon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421–436. Springer, 2012.
- Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
- Tighter lower bounds for shuffling sgd: Random permutations and beyond. International Conference on Machine Learning, 2023.
- Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.
- High-probability bounds for non-convex stochastic optimization with heavy tails. Adv. in Neural Information Processing Systems, 34:4883–4895, 2021.
- Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
- No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In International Conference on Machine Learning, 2017.
- Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013.
- Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Adv. in Neural Info. Processing Systems, 2020.
- Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010.
- Convergence rate of incremental gradient and incremental Newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019.
- Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1-2):49–84, 2021.
- Random shuffling beats SGD after finite epochs. In International Conference on Machine Learning, pages 2624–2633, 2019.
- Tight analyses for non-smooth stochastic gradient descent. Annual Conference Computational Learning Theory, 2018.
- Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- An improved analysis and rates for variance reduction under without-replacement sampling orders. Advances in Neural Information Processing Systems, 34, 2021.
- How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732, 2017.
- On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM, 68(2):1–29, 2021.
- Better theory for sgd in the nonconvex world. Transactions on Machine Learning Research, 2022.
- Using statistics to automate stochastic optimization. In Advances in Neural Information Processing Systems, 2019.
- A unified convergence theorem for stochastic optimization methods. In Advances in Neural Information Processing Systems, volume 35, pages 33107–33119, 2022.
- Convergence of random reshuffling under the kurdyka-łojasiewicz inequality. SIAM Journal on Optimization, 33(2):1092–1120, 2023.
- GraB: Finding provably better data permutations than random reshuffling. Neural Information Processing Systems, 2022.
- Random reshuffling with variance reduction: New analysis and better rates. Conference on Uncertainty in Arti. Intell., 2021.
- Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 2020.
- Surpassing gradient descent provably: A cyclic incremental method with linear convergence rate. SIAM Journal on Optimization, 28(2):1420–1447, 2018.
- Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1):109–138, 2001.
- Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
- A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021.
- Vivak Patel. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Mathematical Programming, 195(1-2):693–734, 2022.
- Closing the convergence gap of SGD without replacement. International Conference On Machine Learning, 2020.
- Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013.
- A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- How good is SGD with random shuffling? In Conference on Learning Theory, volume 125, pages 3250–3284, 2020.
- Ruo-Yu Sun. Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294, 2020.
- Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
- George Yin. A stopping rule for the Robbins-Monro method. Journal of Optimization Theory and Applications, 67(1):151–173, 1990.
- Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In International Conference on Learning Representations, 2022.
- Why are adaptive methods good for attention models? Adv. in Neu. Info. Process. Systems, 2020.