Continuous-Time Distributed Dynamic Programming for Networked Multi-Agent Markov Decision Processes (2307.16706v7)
Abstract: The main goal of this paper is to investigate continuous-time distributed dynamic programming (DP) algorithms for networked multi-agent Markov decision problems (MAMDPs). In our study, we adopt a distributed multi-agent framework where individual agents have access only to their own rewards, lacking insights into the rewards of other agents. Moreover, each agent has the ability to share its parameters with neighboring agents through a communication network, represented by a graph. We first introduce a novel distributed DP, inspired by the distributed optimization method of Wang and Elia. Next, a new distributed DP is introduced through a decoupling process. The convergence of the DP algorithms is proved through systems and control perspectives. The study in this paper sets the stage for new distributed temporal different learning algorithms.
- M. L. Puterman, “Markov decision processes,” Handbooks in operations research and management science, vol. 2, pp. 331–434, 1990.
- K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
- D. Lee, N. He, P. Kamalaruban, and V. Cevher, “Optimization for reinforcement learning: From a single agent to cooperative agents,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 123–135, 2020.
- A. Jadbabaie, J. Lin, and A. Morse, “Coordination of groups of mobile autonomous agents using nearest neighbor rules,” IEEE Transactions on Automatic Control, vol. 48, no. 6, pp. 988–1001, 2003.
- A. Nedić and A. Ozdaglar, “Subgradient methods for saddle-point problems,” Journal of optimization theory and applications, vol. 142, no. 1, pp. 205–228, 2009.
- A. Nedic, A. Ozdaglar, and P. A. Parrilo, “Constrained consensus and optimization in multi-agent networks,” IEEE Transactions on Automatic Control, vol. 55, no. 4, pp. 922–938, 2010.
- W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
- J. Wang and N. Elia, “Control approach to distributed optimization,” in 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2010, pp. 557–561.
- H. K. Khalil, “Nonlinear systems third edition,” Patience Hall, vol. 115, 2002.
- S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed, “Distributed policy evaluation under multiple behavior strategies,” IEEE Transactions on Automatic Control, vol. 60, no. 5, pp. 1260–1274, 2015.
- M. S. Stanković and S. S. Stanković, “Multi-agent temporal-difference learning with linear function approximation: weak convergence under time-varying network topologies,” in American Control Conference (ACC), 2016, pp. 167–172.
- X. Sha, J. Zhang, K. Zhang, K. You, and T. Basar, “Asynchronous policy evaluation in distributed reinforcement learning over networks,” arXiv preprint arXiv:2003.00433, 2020.
- T. Doan, S. Maguluri, and J. Romberg, “Finite-time analysis of distributed TD(0) with linear function approximation on multi-agent reinforcement learning,” in International Conference on Machine Learning, 2019, pp. 1626–1635.
- H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via double averaging primal-dual optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 9649–9660.
- L. Cassano, K. Yuan, and A. H. Sayed, “Multiagent fully decentralized value function learning with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, no. 4, pp. 1497–1512, 2020.
- D. Ding, X. Wei, Z. Yang, Z. Wang, and M. R. Jovanović, “Fast multi-agent temporal-difference learning via homotopy stochastic primal-dual optimization,” arXiv preprint arXiv:1908.02805, 2019.
- P. Heredia and S. Mou, “Finite-sample analysis of multi-agent policy evaluation with kernelized gradient temporal difference,” in 2020 59th IEEE Conference on Decision and Control (CDC), 2020, pp. 5647–5652.
- D. Lee, D. W. Kim, and J. Hu, “Distributed off-policy temporal difference learning using primal-dual method,” IEEE Access, vol. 10, pp. 107 077–107 094, 2022.
- R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora, “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 993–1000.
- R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine learning, vol. 3, no. 1, pp. 9–44, 1988.
- V. S. Borkar and S. P. Meyn, “The ODE method for convergence of stochastic approximation and reinforcement learning,” SIAM Journal on Control and Optimization, vol. 38, no. 2, pp. 447–469, 2000.
- D. Lee and N. He, “A unified switching system perspective and convergence analysis of Q-learning algorithms,” in 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020.
- D. Lee, J. Hu, and N. He, “A discrete-time switching system analysis of Q-learning,” SIAM Journal on Control and Optimization (accepted), 2022.