Convergence and concentration properties of constant step-size SGD through Markov chains (2306.11497v2)
Abstract: We consider the optimization of a smooth and strongly convex objective using constant step-size stochastic gradient descent (SGD) and study its properties through the prism of Markov chains. We show that, for unbiased gradient estimates with mildly controlled variance, the iteration converges to an invariant distribution in total variation distance. We also establish this convergence in Wasserstein-2 distance in a more general setting compared to previous work. Thanks to the invariance property of the limit distribution, our analysis shows that the latter inherits sub-Gaussian or sub-exponential concentration properties when these hold true for the gradient. This allows the derivation of high-confidence bounds for the final estimate. Finally, under such conditions in the linear case, we obtain a dimension-free deviation bound for the Polyak-Ruppert average of a tail sequence. All our results are non-asymptotic and their consequences are discussed through a few applications.
- Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. The Journal of Machine Learning Research, 15(1):595–627, 2014.
- Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)𝑂1𝑛O(1/n)italic_O ( 1 / italic_n ). Advances in Neural Information Processing Systems, 26:773–781, 2013.
- Modified logarithmic Sobolev inequalities on ℝℝ\mathbb{R}blackboard_R. Potential Analysis, 29(2):167–193, 2008.
- Peter H Baxendale. Renewal theory and computable convergence rates for geometrically ergodic Markov chains. The Annals of Applied Probability, 15(1B):700–738, 2005.
- Witold Bednorz. The Kendall theorem and its application to the geometric ergodicity of Markov chains. Applicationes Mathematicae, 40(2):129–165, 2013.
- Geometric renewal convergence rates from hazard rates. Journal of applied probability, 38(1):180–194, 2001.
- Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28, 1999.
- Poincaré’s inequalities and Talagrand’s concentration phenomenon for the exponential distribution. Probability Theory and Related Fields, 107:383–400, 1997.
- The tradeoffs of large scale learning. Advances in Neural Information Processing Systems, 20:161–168, 2007.
- Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford university press, 2013.
- Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006.
- Kai Lai Chung. On a stochastic approximation method. The Annals of Mathematical Statistics, pages 463–483, 1954.
- Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, pages 205–213. PMLR, 2015.
- Comparison theorems for reversible Markov chains. The Annals of Applied Probability, 3(3):696–730, 1993.
- Geometric Bounds for Eigenvalues of Markov Chains. The Annals of Applied Probability, 1(1):36–61, 1991.
- Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016.
- Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3):1348–1382, 2020.
- Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
- Quantitative bounds for geometric convergence rates of Markov chains. Ann. Appl. Probab, 14(4):1643–1665, 2004.
- Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009.
- Vaclav Fabian. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, pages 1327–1332, 1968.
- David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
- Exponential convergence rates for stochastically ordered Markov processes under perturbation. Systems & Control Letters, 133:104515, 2019.
- Modified logarithmic Sobolev inequalities and transportation inequalities. Probability Theory and Related Fields, 133:409–436, 2005.
- Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, II: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
- Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042–15053, 2020.
- Nathael Gozlan. A characterization of dimension free concentration in terms of transportation inequalities. The Annals of Probability, 37(6):2480–2498, 2009.
- Transport Inequalities. A Survey. Markov Processes and Related Fields, 16:635–736, 2010.
- On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31–61, 1996.
- Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
- Simple and optimal high-probability bounds for strongly-convex stochastic gradient descent. arXiv preprint arXiv:1909.00843, 2019.
- Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory, pages 197–209. PMLR, 2014.
- A Markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv preprint arXiv:1710.09430, 2017.
- Parallelizing stochastic gradient descent for least squares regression: Mini-batching, averaging, and model misspecification. Journal of Machine Learning Research, 18(223):1–42, 2018.
- Making the last iterate of SGD information theoretically optimal. In Conference on Learning Theory, pages 1752–1755. PMLR, 2019.
- Daniel C Jerison. Quantitative convergence rates for reversible Markov chains via strong random times. arXiv preprint arXiv:1908.06459, 2019.
- On the generalization ability of online strongly convex programming algorithms. Advances in Neural Information Processing Systems, 21, 2008.
- Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pages 795–811. Springer, 2016.
- David G Kendall. Unitary dilations of Markov transition operators, and the corresponding integral representations for transition-probability matrices. Probability and statistics, pages 139–161, 1959.
- Masaaki Kijima. Markov Processes for Stochastic Modeling, volume 6. CRC Press, 1997.
- Concentration of measure without independence: a unified approach via the martingale method. In Convexity and Concentration, pages 183–210. Springer, 2017.
- Geometric ergodicity and the spectral gap of non-reversible Markov chains. Probability Theory and Related Fields, 154(1-2):327–339, 2012.
- On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951.
- A simpler approach to obtaining an O(1/t)𝑂1𝑡O(1/t)italic_O ( 1 / italic_t ) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002, 2012.
- Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics, pages 1347–1355. PMLR, 2018.
- Michel Ledoux. On Talagrand’s deviation inequalities for product measures. ESAIM: Probability and Statistics, 1:63–87, 1997.
- A high probability analysis of adaptive SGD with momentum. arXiv preprint arXiv:2007.14294, 2020.
- Beyond sub-Gaussian noises: Sharp concentration analysis for stochastic gradient descent. Journal of Machine Learning Research, 23:1–22, 2022.
- Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190, 2019.
- A monotonicity in reversible Markov chains. Journal of Applied Probability, 43(2):486–499, 2006.
- Computable exponential convergence rates for stochastically ordered Markov processes. The Annals of Applied Probability, 6(1):218–237, 1996.
- Geometric convergence rates for stochastically ordered Markov chains. Mathematics of Operations Research, 21(1):182–194, 1996.
- The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning. In International Conference on Machine Learning, 2017.
- Quantitative bounds for Markov chain convergence: Wasserstein and total variation distances. Bernoulli, 16(3):882–908, 2010.
- Concentration inequalities under sub-Gaussian and sub-exponential conditions. Advances in Neural Information Processing Systems, 34:7588–7597, 2021.
- Colin McDiarmid. Concentration. Probabilistic Methods for Algorithmic Discrete Mathematics, pages 195–248, 1998.
- Markov Chains and Stochastic Stability. Springer London, 1993.
- Computable bounds for geometric convergence rates of Markov chains. The Annals of Applied Probability, pages 981–1011, 1994.
- On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Conference on Learning Theory, pages 2947–2997. PMLR, 2020.
- Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in Neural Information Processing Systems, 24:451–459, 2011.
- An improper estimator with optimal excess risk in misspecified density estimation and logistic regression. Journal of Machine Learning Research, 23(31):1–49, 2022.
- Convergence rate of incremental subgradient algorithms. Stochastic Optimization: Algorithms and Applications, pages 223–264, 2001.
- Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Advances in Neural Information Processing Systems, 27, 2014.
- Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014.
- Iterate averaging as regularization for stochastic gradient descent. In Conference On Learning Theory, pages 3222–3242. PMLR, 2018.
- High probability convergence of clipped-sgd under heavy-tailed noise. arXiv preprint arXiv:2302.05437, 2023.
- Georg Ch Pflug. Stochastic minimization with constant step-size: asymptotic laws. SIAM Journal on Control and Optimization, 24(4):655–666, 1986.
- Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
- On the limitations of single-step drift and minorization in Markov chain convergence analysis. The Annals of Applied Probability, 31(4):1633–1659, 2021.
- Wasserstein-based methods for convergence complexity analysis of MCMC with applications. The Annals of Applied Probability, 32(1):124–166, 2022.
- Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, page 1571–1578. Omnipress, 2012.
- A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
- Bounds on regeneration times and convergence rates for Markov chains. Stochastic Processes and their applications, 80(2):211–229, 1999.
- Rates of convergence of stochastically monotone and continuous time Markov models. Journal of Applied Probability, 37(2):359–373, 2000.
- Geometric L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT convergence are equivalent for reversible Markov chains. Journal of Applied Probability, 38(A):37–41, 2001.
- Jeffrey Rosenthal. Quantitative convergence rates of Markov chains: A simple account. Electronic Communications in Probability, 7:123–128, 2002.
- Jeffrey S Rosenthal. Convergence rates for Markov chains. Siam Review, 37(3):387–405, 1995.
- David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
- Jerome Sacks. Asymptotic distribution of stochastic approximation procedures. The Annals of Mathematical Statistics, 29(2):373–405, 1958.
- High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. arXiv preprint arXiv:2302.00999, 2023.
- Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
- DJ Scott and RL Tweedie. Explicit rates of convergence of stochastically ordered Markov chains. In Athens Conference on Applied Probability and Time Series Analysis, pages 176–191. Springer, 1996.
- Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79. PMLR, 2013.
- Heavy-tailed streaming statistical estimation. In International Conference on Artificial Intelligence and Statistics, pages 1251–1282. PMLR, 2022.
- Ramon Van Handel. Probability in high dimension. Technical report, Pinceton University, 2014.
- Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pages 1195–1204. PMLR, 2019.
- Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Cédric Villani et al. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin, Heidelberg, 2009.
- An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. Advances in Neural Information Processing Systems, 34:4234–4248, 2021.