Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms (1911.10635v2)

Published 24 Nov 2019 in cs.LG, cs.AI, cs.MA, and stat.ML

Abstract: Recent years have witnessed significant advances in reinforcement learning (RL), which has registered great success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.

PDF Abstract

This paper provides a selective overview of multi-agent reinforcement learning (MARL), focusing on algorithms with theoretical guarantees. It contrasts MARL with single-agent RL, highlighting the unique challenges and frameworks inherent in multi-agent scenarios.

1. Introduction and Background

MARL deals with sequential decision-making problems involving multiple autonomous agents interacting within a common environment. Unlike single-agent RL, where an agent optimizes its own return against a stationary environment typically modeled as a Markov Decision Process (MDP), MARL agents' optimal strategies depend on the concurrent actions and learning processes of others, making the environment non-stationary.

The paper primarily discusses two theoretical frameworks for MARL:

Markov/Stochastic Games (MGs): A direct generalization of MDPs to multiple agents $(\cN,\cS,\{\cA^i\}_{i\in\cN},\cP,\{R^i\}_{i\in\cN},\gamma)$. Each agent $i$ aims to find a policy $\pi^i: \cS \to \Delta(\cA^i)$ to maximize its own long-term discounted reward $V^i_{\pi^i, \pi^{-i}}$ . The standard solution concept is the Nash Equilibrium (NE), a joint policy $\pi^*$ where no agent can improve its reward by unilaterally changing its policy. MGs encompass cooperative (common reward $R^i=R$ ), competitive (zero-sum $\sum R^i=0$ ), and mixed (general-sum) settings.
Extensive-Form Games (EFGs): Suitable for modeling sequential actions and imperfect information $(\cN\cup \{c\},\cH,\cZ,\cA, \{ R^i\}_{i \in \cN},\tau,\pi^c,\cS)$. Histories $h \in \cH$ track sequences of actions, $\tau(h)$ indicates the acting agent (or chance $c$ ), and information sets $s \in \cS$ group histories indistinguishable to an agent. Under perfect recall, behavioral policies $\pi^i: \cS^i \to \Delta(\cA(s))$ suffice. The goal is often to find an $\epsilon$ -Nash equilibrium.

2. Challenges in MARL Theory

Developing theoretical guarantees for MARL algorithms faces several hurdles beyond single-agent RL:

Non-Unique Learning Goals: Convergence to NE is common but not always the most relevant or achievable goal. Alternatives include stability, rationality against specific opponent classes, regret minimization, communication efficiency, and robustness.
Non-Stationarity: As agents learn concurrently, the environment from each agent's perspective changes, violating standard RL assumptions. Independent learning often fails.
Scalability (Combinatorial Nature): The joint action space grows exponentially with the number of agents, making algorithms computationally expensive.
Information Structures: Agents typically have partial information about others' states, actions, rewards, or policies. Learning schemes vary based on information availability (centralized, decentralized with networked agents, fully decentralized).

3. MARL Algorithms with Theoretical Guarantees

The paper reviews algorithms categorized by game type:

3.1 Cooperative Setting

Homogeneous Agents (Common Reward/Markov Teams):
- Joint-action Q-learning converges to the optimal Q-function, but policy convergence requires coordination (e.g., Optimal Adaptive Learning - OAL).
- Scalability approaches include distributed Q-learning for deterministic MMDPs, value factorization (with recent theoretical justifications), and common interest games.
- Policy-based methods (e.g., actor-critic fictitious play) have limited guarantees, primarily for specific game types.
- Markov Potential Games (MPGs) allow reducing the problem to single-agent RL.
- Mean-Field Games (MFGs) and Mean-Field Control (MFC) simplify analysis for large populations by focusing on average effects. RL algorithms like mean-field Q-learning and PG methods are emerging.
Decentralized Paradigm with Networked Agents (Heterogeneous Rewards, Team-Average Goal):
- Algorithms often involve consensus mechanisms over communication networks.
- $\mathcal{QD}$ -learning combines Q-learning with neighbor estimate averaging.
- Decentralized Actor-Critic algorithms use consensus for critic (value function) updates and local information for actor (policy) updates. Convergence is shown for linear function approximation (LFA). Variants handle continuous spaces and off-policy learning.
- Decentralized Fitted Q-Iteration (FQI) provides finite-sample analysis, considering errors from decentralized computation.
- Policy Evaluation: Focuses on estimating value functions for fixed policies. Distributed TD(0), TD( $\lambda$ ), and gradient TD methods (using MSPBE objective and saddle-point formulations) exist with convergence guarantees (asymptotic and finite-time).
- Communication Efficiency: Algorithms like LAPG, hierarchical methods, and scalar transmission strategies aim to reduce communication overhead.
Partially Observed Model (Dec-POMDPs):
- Generally NEXP-complete. Most algorithms use centralized learning, decentralized execution.
- Methods include converting to NOMDPs/oMDPs, using finite-state controllers (FSCs), sampling-based approaches (MCTS), and PG methods.
- Decentralized learning is possible under specific structures (e.g., common information, common random number generator).

3.2 Competitive Setting (Mainly Two-Player Zero-Sum)

Value-Based Methods:
- Focus on finding the optimal value function $V^*$ , the unique fixed point of the minimax BeLLMan operator $\cT^* V = \text{Value}[Q_V]$.
- Value Iteration ($V_{t+1} = \cT^*V_t$) and Policy Iteration converge linearly.
- Minimax Q-learning is a model-free, tabular extension converging to the optimal Q-function $Q^*$ . Extensions exist for LQ games.
- Batch RL (FQI for zero-sum games) offers finite-sample bounds with general function approximation.
- Algorithms for Turn-Based Stochastic Games (TBSGs) achieve near-optimal sample complexity with generative models.
- Online learning algorithms (e.g., UCSG) achieve sublinear regret against arbitrary opponents.
- For EFGs, sequence-form representation allows solving via LP (for small games). MCTS methods converge to minimax solutions in turn-based games.
Policy-Based Methods:
- Aim for no-regret learning, where time-average policies converge to approximate NE in self-play for zero-sum games.
- Fictitious Play (FP): Agents play best-response to opponents' average historical play. Continuous-time FP is Hannan consistent; discrete variants with smoothing converge. Neural Fictitious Self-Play (NFSP) applies this to large games using deep RL. Actor-critic methods can implement smooth FP.
- Counterfactual Regret Minimization (CFR): Minimizes regret at each information set, bounding overall regret. Vanilla CFR requires full tree traversal. Variants use sampling (MCCFR), function approximation (Deep CFR), pruning, and variance reduction. CFR is linked to policy gradient methods (e.g., A2C).
- Policy Gradient in Continuous Games: Vanilla PG often fails to converge (limit cycles, non-Nash stable points). Remedies include optimistic/extragradient methods, second-order information, or focusing on Stackelberg equilibria. Global convergence is rare but shown for specific LQ game settings.

3.3 Mixed Setting (General-Sum)

Theoretically most challenging; finding NE is hard, and standard methods may fail.
Value-Based Methods:
- Nash Q-learning converges under strong uniqueness assumptions.
- Friend-or-Foe Q-learning simplifies by assuming fixed opponent types.
- Correlated Q-learning targets correlated equilibria.
- Batch RL methods can find approximate NE via BeLLMan residual minimization or FQI in specific team vs. team settings.
- Decentralized Q-learning converges for weakly acyclic games using two timescales.
Policy-Based Methods:
- Convergence guarantees are often limited to specific game classes (e.g., Morse-Smale) or local analysis around stable equilibria. Methods like Symplectic Gradient Adjustment are proposed.
- Self-play with no-regret algorithms guarantees convergence to coarse correlated equilibria in normal-form games.
Mean-Field Regime:
- Mean-field Nash Q-learning approximates opponent actions via empirical averages.
- MFG equilibrium finding often uses fixed-point iterations, solvable via single-agent RL (Q-learning, PG) combined with mean-field estimation or fictitious play updates.

4. Application Highlights

MARL techniques drive progress in:

Cooperative: UAV coordination (coverage, communication links, spectrum sharing), learning emergent communication protocols.
Competitive: Game playing like Go (AlphaGo/AlphaZero using deep RL, MCTS, self-play) and Poker (Libratus/Pluribus using CFR variants, abstraction, subgame solving).
Mixed: Multiplayer Poker (Pluribus), complex video games like StarCraft II (AlphaStar) and Dota 2 (OpenAI Five) using deep RL, self-play, and large-scale training, and modeling social dilemmas.

5. Conclusion and Future Directions

MARL theory is advancing but faces significant challenges. Key open areas for future theoretical work include:

MARL in partially observed settings (POSGs).
Establishing theoretical foundations for deep MARL.
Developing efficient model-based MARL algorithms.
Understanding the global convergence properties of policy gradient methods in general MARL.
Incorporating robustness and safety guarantees into MARL algorithms.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Kaiqing Zhang (70 papers)
Zhuoran Yang (155 papers)
Tamer Başar (200 papers)

Citations (1,061)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/willccbb/status/1881818725482897794