Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds (2306.12498v2)
Abstract: Stochastic gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning. Contrary to the empirical practice of sampling from the datasets without replacement and with (possible) reshuffling at each epoch, the theoretical counterpart of SGD usually relies on the assumption of sampling with replacement. It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed. For convex finite sum problems with $n$ components and under the $L$-smoothness assumption for each component function, there are matching upper and lower bounds, under sufficiently small -- $\mathcal{O}(\frac{1}{nL})$ -- step sizes. Yet those bounds appear too pessimistic -- in fact, the predicted performance is generally no better than for full gradient descent -- and do not agree with the empirical observations. In this work, to narrow the gap between the theory and practice of shuffled SGD, we sharpen the focus from general finite sum problems to empirical risk minimization with linear predictors. This allows us to take a primal-dual perspective and interpret shuffled SGD as a primal-dual method with cyclic coordinate updates on the dual side. Leveraging this perspective, we prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds. Notably, our bounds predict much faster convergence than the existing analyses -- by a factor of the order of $\sqrt{n}$ in some cases. We empirically demonstrate that on common machine learning datasets our bounds are indeed much tighter. We further extend our analysis to nonsmooth convex problems and more general finite-sum problems, with similar improvements.
- Broad bioimage benchmark collection. https://bbbc.broadinstitute.org/image_sets. Accessed: 2023-05-16.
- Information-theoretic lower bounds on the oracle complexity of convex optimization. In Proc. NeurIPS’09, 2009.
- SGD with shuffling: optimal rates without component convexity and large epoch requirements. In Proc. NeurIPS’20, 2020.
- Beck, A. First-order methods in optimization. SIAM, 2017.
- On the convergence of block coordinate descent type methods. SIAM Journal on Optimization, 23(4):2037–2060, 2013.
- Bengio, Y. Practical recommendations for gradient-based training of deep architectures. Neural Networks: Tricks of the Trade: Second Edition, 2012.
- Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627–642, 2000.
- Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. In Proc. Symposium on Learning and Data Science, Paris’09, 2009.
- Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
- Bubeck, S. et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 2015.
- Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. arXiv preprint arXiv:2212.05088, 2022.
- Tighter lower bounds for shuffling SGD: Random permutations and beyond. arXiv preprint arXiv:2303.07160, 2023.
- A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011.
- Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM Journal on Optimization, 28(4):2783–2808, 2018.
- LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):1–27, 2011.
- De Sa, C. M. Random reshuffling is not always better. In Proc. NeurIPS’20, 2020.
- Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- When cyclic coordinate descent outperforms randomized coordinate descent. In Proc. NeurIPS’17, 2017.
- Convergence rate of incremental gradient and incremental Newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019.
- Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186:49–84, 2021.
- Random shuffling beats SGD after finite epochs. In Proc. ICML’19, 2019.
- Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
- Recht-ré noncommutative arithmetic-geometric mean conjecture is false. In Proc. ICML’20, 2020.
- Random permutations fix a worst case for cyclic coordinate descent. IMA Journal of Numerical Analysis, 39(3):1246–1275, 2019.
- On faster convergence of cyclic block coordinate descent-type methods for strongly convex minimization. The Journal of Machine Learning Research, 18(1):6741–6764, 2017.
- Incremental methods for weakly convex optimization. arXiv preprint arXiv:1907.11687, 2019.
- Accelerated cyclic coordinate dual averaging with extrapolation for composite convex optimization. arXiv preprint arXiv:2303.16279, 2023.
- Serial and parallel backpropagation convergence via nonmonotone perturbed minimization. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 1993.
- Random reshuffling: Simple analysis with vast improvements. In Proc. NeurIPS’20, 2020.
- SGD without replacement: Sharper rates for general smooth convex functions. In Proc. ICML’19, 2019.
- Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1):109–138, 2001.
- Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
- A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021.
- Closing the convergence gap of SGD without replacement. In Proc. ICML’20, 2020.
- Toward a noncommutative arithmetic-geometric mean inequality: Conjectures, case-studies, and consequences. In Proc. COLT’12, 2012.
- Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 2013.
- A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400–407, 1951.
- How good is SGD with random shuffling? In Proc. COLT’20, 2020.
- On the nonasymptotic convergence of cyclic coordinate descent methods. SIAM Journal on Optimization, 23(1):576–601, 2013.
- Shamir, O. Without-replacement sampling for stochastic gradient methods. In Proc. NeurIPS’16, 2016.
- Fast cyclic coordinate dual averaging with extrapolation for generalized variational inequalities. arXiv preprint arXiv:2102.13244, 2021.
- Worst-case complexity of cyclic coordinate descent: O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) gap with randomized version. Mathematical Programming, 185(1):487–520, 2021.
- SMG: A shuffling gradient-based method with momentum. In Proc. ICML’21, 2021.
- Nesterov accelerated shuffling gradient method for convex optimization. In Proc. ICML’22, 2022.
- Analyzing random permutations for cyclic coordinate descent. Mathematics of Computation, 89(325):2217–2248, 2020.
- Block stochastic gradient iteration for convex and nonconvex optimization. SIAM Journal on Optimization, 25(3):1686–1716, 2015.
- A globally convergent algorithm for nonconvex optimization based on block coordinate update. Journal of Scientific Computing, 72(2):700–734, 2017.
- Stochastic learning under random reshuffling with constant step-sizes. IEEE Transactions on Signal Processing, 67(2):474–489, 2018.
- Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In Proc. ICLR’22, 2022.