Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stochastic Gradient Descent under Markovian Sampling Schemes (2302.14428v3)

Published 28 Feb 2023 in math.OC and cs.LG

Abstract: We study a variation of vanilla stochastic gradient descent where the optimizer only has access to a Markovian sampling scheme. These schemes encompass applications that range from decentralized optimization with a random walker (token algorithms), to RL and online system identification problems. We focus on obtaining rates of convergence under the least restrictive assumptions possible on the underlying Markov chain and on the functions optimized. We first unveil the theoretical lower bound for methods that sample stochastic gradients along the path of a Markov chain, making appear a dependency in the hitting time of the underlying Markov chain. We then study Markov chain SGD (MC-SGD) under much milder regularity assumptions than prior works (e.g., no bounded gradients or domain, and infinite state spaces). We finally introduce MC-SAG, an alternative to MC-SGD with variance reduction, that only depends on the hitting time of the Markov chain, therefore obtaining a communication-efficient token algorithm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Linear convergence of primal-dual gradient methods and their performance in distributed optimization, 2019.
  2. On the ergodicity properties of some adaptive MCMC algorithms. The Annals of Applied Probability, 16(3), August 2006.
  3. Stability of stochastic approximation under verifiable conditions. SIAM Journal on Control and Optimization, 44(1):283–312, January 2005.
  4. Asynchronous gradient push. IEEE Transactions on Automatic Control, 66(1):168–183, 2021.
  5. Adaptive Algorithms and Stochastic Approximations. Springer Berlin Heidelberg, 1990.
  6. Flex: an adaptive exploration algorithm for nonlinear systems. In International Conference on Machine Learning, 2023.
  7. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, January 2018.
  8. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
  9. Adaptive kalman filtering. Journal of Research of the National Bureau of Standards, 90(6):403, November 1985.
  10. Sébastien Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3–4):231–357, November 2015.
  11. Lower bounds for finding stationary points ii: First-order methods. Math. Program., 185(1–2):315–355, jan 2021. ISSN 0025-5610.
  12. Muffliato: Peer-to-peer privacy amplification for decentralized optimization and averaging. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  13. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  14. Gossip algorithms for distributed signal processing. Proceedings of the IEEE, 98(11):1847–1864, 2010.
  15. Thinh T. Doan. Finite-time analysis of markov gradient descent. IEEE Transactions on Automatic Control, 68(4):2140–2153, 2023.
  16. Adapting to mixing time in stochastic optimization with Markovian data. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5429–5446. PMLR, 17–23 Jul 2022.
  17. Fast stochastic bregman gradient methods: Sharp analysis and variance reduction. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2815–2825. PMLR, 18–24 Jul 2021.
  18. Ergodic mirror descent. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 701–706, 2011. doi: 10.1109/Allerton.2011.6120236.
  19. Continuized accelerations of deterministic and stochastic gradient descents, and of gossip algorithms. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 28054–28066. Curran Associates, Inc., 2021a.
  20. Decentralized optimization with heterogeneous delays: a continuous-time approach. Technical report, arXiv:2106.03585, 2021b.
  21. On sample optimality in personalized collaborative and federated learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  22. Convergence of markovian stochastic approximation with discontinuous dynamics. SIAM Journal on Control and Optimization, 54(2):866–893, January 2016.
  23. Hadrien Hendrikx. A principled framework for the design and analysis of token algorithms, 2022.
  24. An accelerated decentralized stochastic proximal algorithm for finite sums. In Advances in Neural Information Processing Systems, 2019.
  25. Sample complexity lower bounds for linear system identification. In IEEE Conference on decision and control (CDC) 2019, pages 2676–2681, 12 2019.
  26. A simple peer-to-peer algorithm for distributed optimization in sensor networks. In 2007 46th IEEE Conference on Decision and Control, pages 4705–4710, 2007. doi: 10.1109/CDC.2007.4434888.
  27. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2010.
  28. SCAFFOLD: Stochastic controlled averaging for federated learning. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5132–5143. PMLR, 13–18 Jul 2020.
  29. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, volume 97, pages 3478–3487. PMLR, 2019.
  30. A unified theory of decentralized SGD with changing topology and local updates. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5381–5393. PMLR, 13–18 Jul 2020.
  31. Lower bounds and optimal algorithms for smooth and strongly convex decentralized optimization over time-varying networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 22325–22335. Curran Associates, Inc., 2021.
  32. Streaming linear system identification with reverse experience replay. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 30140–30152. Curran Associates, Inc., 2021.
  33. Markov chains and mixing times. American Mathematical Society, 2006.
  34. Incremental adaptive strategies over distributed networks. IEEE Transactions on Signal Processing, 55(8):4064–4077, 2007. doi: 10.1109/TSP.2007.896034.
  35. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202–2229, January 2017.
  36. Walkman: A communication-efficient random-walk algorithm for decentralized optimization. IEEE Transactions on Signal Processing, 68:2513–2528, 2020. doi: 10.1109/TSP.2020.2983167.
  37. Peter Matthews. Covering Problems for Markov Chains. The Annals of Probability, 16(3):1215 – 1228, 1988.
  38. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747–2774, 2018.
  39. Random reshuffling: Simple analysis with vast improvements. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17309–17320. Curran Associates, Inc., 2020.
  40. Asynchronous SGD beats minibatch SGD under arbitrary delays. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  41. Least squares regression with markovian data: Fundamental limits and algorithms. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16666–16676. Curran Associates, Inc., 2020.
  42. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009. doi: 10.1109/TAC.2008.2009515.
  43. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, May 2018.
  44. Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014. ISBN 1461346916.
  45. Shravas Rao. Finding hitting times in various graphs, 2012.
  46. John Rust. Structural estimation of markov decision processes. In R. F. Engle and D. McFadden, editors, Handbook of Econometrics, volume 4, chapter 51, pages 3081–3143. Elsevier, 1 edition, 1986.
  47. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3027–3036. PMLR, 06–11 Aug 2017.
  48. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1–2):83–112, mar 2017. ISSN 0025-5610.
  49. Learning without mixing: Towards a sharp analysis of linear system identification. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 439–473. PMLR, 06–09 Jul 2018.
  50. The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication, 2021.
  51. On markov chain gradient descent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  52. Adaptive random walk gradient descent for decentralized optimization. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20790–20809. PMLR, 17–23 Jul 2022.
  53. Stability and generalization for markov chain stochastic gradient methods. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  54. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7184–7193. PMLR, 09–15 Jun 2019.
Citations (25)

Summary

We haven't generated a summary for this paper yet.