Accelerating Distributed Stochastic Optimization via Self-Repellent Random Walks (2401.09665v1)
Abstract: We study a family of distributed stochastic optimization algorithms where gradients are sampled by a token traversing a network of agents in random-walk fashion. Typically, these random-walks are chosen to be Markov chains that asymptotically sample from a desired target distribution, and play a critical role in the convergence of the optimization iterates. In this paper, we take a novel approach by replacing the standard linear Markovian token by one which follows a nonlinear Markov chain - namely the Self-Repellent Radom Walk (SRRW). Defined for any given 'base' Markov chain, the SRRW, parameterized by a positive scalar {\alpha}, is less likely to transition to states that were highly visited in the past, thus the name. In the context of MCMC sampling on a graph, a recent breakthrough in Doshi et al. (2023) shows that the SRRW achieves O(1/{\alpha}) decrease in the asymptotic variance for sampling. We propose the use of a 'generalized' version of the SRRW to drive token algorithms for distributed stochastic optimization in the form of stochastic approximation, termed SA-SRRW. We prove that the optimization iterate errors of the resulting SA-SRRW converge to zero almost surely and prove a central limit theorem, deriving the explicit form of the resulting asymptotic covariance matrix corresponding to iterate errors. This asymptotic covariance is always smaller than that of an algorithm driven by the base Markov chain and decreases at rate O(1/{\alpha}2) - the performance benefit of using SRRW thereby amplified in the stochastic optimization context. Empirical results support our theoretical findings.
- Reversible markov chains and random walks on graphs, 2002. Unfinished monograph, recompiled 2014, available at http://www.stat.berkeley.edu/~aldous/RWG/book.html.
- Non-linear markov chain monte carlo. In Esaim: Proceedings, volume 19, pp. 79–84. EDP Sciences, 2007.
- Private weighted random walk stochastic gradient descent. IEEE Journal on Selected Areas in Information Theory, 2(1):452–463, 2021.
- Convergence and dynamical behavior of the adam algorithm for nonconvex stochastic optimization. SIAM Journal on Optimization, 31(1):244–274, 2021.
- Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance. Electronic Journal of Statistics, 15(2):3892–3947, 2021.
- M Benaim and Bertrand Cloez. A stochastic approximation approach to quasi-stationary distributions on finite spaces. Electronic Communications in Probability 37 (20), 1-14.(2015), 2015.
- Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012.
- V.S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint: Second Edition. Texts and Readings in Mathematics. Hindustan Book Agency, 2022. ISBN 9788195196111.
- Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
- Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
- Pierre Brémaud. Markov chains: Gibbs fields, Monte Carlo simulation, and queues, volume 31. Springer Science & Business Media, 2013.
- Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
- Nonlinear dynamical systems and control: A Lyapunov-based approach. Princeton University Press, 2008.
- On the convergence of decentralized federated learning under imperfect information sharing. arXiv preprint arXiv:2303.10695, 2023.
- Han-Fu Chen. Stochastic approximation and its applications, volume 64. Springer Science & Business Media, 2006.
- Explicit mean-square error bounds for monte-carlo and linear stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pp. 4173–4183. PMLR, 2020a.
- Finite-sample analysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874, 2020b.
- Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning. Automatica, 146:110623, 2022.
- Burgess Davis. On the intergrability of the martingale square function. Israel Journal of Mathematics, 8:187–190, 1970.
- Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, volume 1, 2014.
- Interacting markov chain monte carlo methods for solving nonlinear measure-valued equations1. The Annals of Applied Probability, 20(2):593–639, 2010.
- Self-interacting markov chains. Stochastic Analysis and Applications, 24(3):615–660, 2006.
- Bernard Delyon. Stochastic approximation with decreasing gain: Convergence and asymptotic theory. Technical report, Université de Rennes, 2000.
- Convergence of a stochastic approximation version of the em algorithm. Annals of statistics, pp. 94–128, 1999.
- Zap q-learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2232–2241, 2017.
- Q-learning with uniformly bounded variance. IEEE Transactions on Automatic Control, 2021.
- Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 1626–1635. PMLR, 2019.
- Thinh T Doan. Finite-time convergence rates of nonlinear two-time-scale stochastic approximation under markovian noise. arXiv preprint arXiv:2104.01627, 2021.
- Convergence rates of accelerated markov gradient descent with applications in reinforcement learning. arXiv preprint arXiv:2002.02873, 2020.
- Self-repellent random walks on general graphs–achieving minimal sampling variance via nonlinear markov chains. In International Conference on Machine Learning. PMLR, 2023.
- Marie Duflo. Algorithmes stochastiques, volume 23. Springer, 1996.
- Mathieu Even. Stochastic gradient descent under markovian sampling schemes. In International Conference on Machine Learning, 2023.
- Gersende Fort. Central limit theorems for stochastic approximation with controlled markov chain dynamics. ESAIM: Probability and Statistics, 19:60–80, 2015.
- Stochastic heavy ball. Electronic Journal of Statistics, 12:461–529, 2018.
- Escaping saddle points efficiently with occupation-time-adapted perturbations. arXiv preprint arXiv:2005.04507, 2020.
- Martingale Limit Theory and Its Application. Communication and Behavior. Elsevier Science, 2014.
- Hadrien Hendrikx. A principled framework for the design and analysis of token algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 470–489. PMLR, 2023.
- A two-timescale stochastic algorithm framework for bilevel optimization: Complexity analysis and application to actor-critic. SIAM Journal on Optimization, 33(1):147–180, 2023.
- Efficiency ordering of stochastic gradient descent. In Advances in Neural Information Processing Systems, 2022.
- How to escape saddle points efficiently. In International conference on machine learning, pp. 1724–1732. PMLR, 2017.
- Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pp. 1042–1085. PMLR, 2018.
- On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM (JACM), 68(2):1–29, 2021.
- Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pp. 1944–1974. PMLR, 2019.
- Two time-scale stochastic approximation with controlled markov noise and off-policy temporal-difference learning. Mathematics of Operations Research, 43(1):130–151, 2018.
- Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Convergence rate of linear two-time-scale stochastic approximation. The Annals of Applied Probability, 14(2):796–819, 2004.
- Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
- Fully decentralized federated learning. In Advances in neural information processing systems, 2018.
- Snap datasets: Stanford large network dataset collection, 2014.
- Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- State dependent performative prediction with stochastic approximation. In International Conference on Artificial Intelligence and Statistics, pp. 3164–3186. PMLR, 2022.
- Revisiting the central limit theorems for the sgd-type methods. arXiv preprint arXiv:2207.11755, 2022.
- Online statistical inference for nonlinear stochastic approximation with markovian data. arXiv preprint arXiv:2302.07690, 2023.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
- Sean Meyn. Control systems and reinforcement learning. Cambridge University Press, 2022.
- The compact law of the iterated logarithm for multivariate stochastic approximation algorithms. Stochastic analysis and applications, 23(1):181–203, 2005.
- Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671–1702, 2006.
- Success and failure of adaptation-diffusion algorithms with decaying step size in multiagent networks. IEEE Transactions on Signal Processing, 65(11):2798–2813, 2017.
- On linear stochastic approximation: Fine-grained polyak-ruppert and non-asymptotic concentration. In Conference on Learning Theory, pp. 2947–2997. PMLR, 2020.
- Angelia Nedic. Distributed gradient methods for convex machine learning problems in networks: Distributed optimization. IEEE Signal Processing Magazine, 37(3):92–101, 2020.
- Alex Olshevsky. Asymptotic network independence and step-size for a distributed subgradient method. Journal of Machine Learning Research, 23(69):1–32, 2022.
- Mariane Pelletier. On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217–244, 1998.
- On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
- The network data repository with interactive graph analytics and visualization. In AAAI, 2015.
- Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017.
- On markov chain gradient descent. In Advances in neural information processing systems, volume 31, 2018.
- Decentralized learning with random walks and communication-efficient adaptive optimization. In Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022), 2022.
- Relaysum for decentralized deep learning on heterogeneous data. In Advances in Neural Information Processing Systems, volume 34, pp. 28004–28015, 2021.
- Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC), pp. 299–300. IEEE, 2019.
- Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent markov noise. Mathematics of Operations Research, 45(4):1405–1444, 2020.
- Decentralized federated learning with unreliable communications. IEEE Journal of Selected Topics in Signal Processing, 16(3):487–500, 2022.
- A two-time-scale stochastic optimization framework with applications in control and reinforcement learning. arXiv preprint arXiv:2109.14756, 2021.