Decentralized Multi-Agent Reinforcement Learning

Updated 28 April 2026

Decentralized multi-agent reinforcement learning is an approach where multiple agents learn and interact without centralized control, relying on limited local information.
It employs algorithmic paradigms like independent Q-learning, actor–critic methods, and selective communication to address non-stationarity, scalability, and privacy challenges.
Practical applications include multi-robot systems, autonomous vehicles, and smart grids, demonstrating effective, resilient coordination in distributed environments.

Decentralized multi-agent reinforcement learning (MARL) concerns the design of algorithms and theoretical frameworks wherein multiple agents interact in a common environment without recourse to a central controller. Each agent typically observes only local information, may communicate (or not) with limited neighbors, and acts to maximize its own or a shared reward, often under constraints imposed by non-stationarity, partial observability, scalability, and privacy. Decentralized MARL methods have emerged in response to challenges such as coordination, scalability, robustness to single-point failures, and the need for practical algorithms in distributed systems such as autonomous vehicle fleets, robotic swarms, sensor networks, and smart grids. This article surveys key models, algorithmic paradigms, scalability principles, representative results, limitations, and current research directions in decentralized MARL, with an emphasis on both classic and recently published approaches.

1. Formal Models and Problem Settings

The prototypical decentralized MARL scenario is formalized as a multi-agent Markov Decision Process (MMDP) or more generally a Markov game. Let $n$ agents indexed by $i=1,\dots,n$ interact on a finite or continuous state space $S$ , each with local action set $A_i$ , defining the joint action space $A=\prod_{i=1}^n A_i$ . The global state transition dynamics are given by a Markov kernel $P(s'|s,a_1,\dots,a_n)$ , and global or local rewards $r(s,a_1,\dots,a_n)$ . Each agent typically has access only to (a) its own observations or local neighborhood information, and (b) possibly restricted communication with neighbors on a fixed or time-varying graph $G=(N,E)$ (Zhang et al., 2019, Lidard et al., 2021, Hassan et al., 2023). Common objectives include maximizing the sum or average of local cumulative rewards, finding Nash or correlated equilibria in general-sum games, or optimizing subject to safety or privacy constraints.

Variants of the formal model address:

Information structure: Full state observability, partial observability (Dec-POMDP), action observability, or restricted communication.
Communication: No communication, explicit message-passing, or communication embedded in environmental signals (stigmergy).
Reward structure: Fully cooperative (identical or shared reward), general-sum, or mixed cooperative–competitive.
Constraints: Safety, privacy, or resource constraints expressed as average or peak utilities (Hassan et al., 2023).
Offline/online learning: Access to offline datasets versus environment interaction (Jiang et al., 2021).
Stability: Closed-loop stability and safety via control-theoretic constraints (Zhang et al., 2020).

2. Principal Algorithmic Approaches

Decentralized MARL algorithms can be grouped by the degree of centralization, communication, and theoretical properties.

2.1 Independent Q-learning and Minimalist Approaches

The simplest method, Independent Q-Learning (IQL), trains local agents' Q-values in isolation, treating other agents as part of the (non-stationary) environment (Su et al., 2022). This is highly scalable but suffers from non-stationarity, leading to instability or suboptimal equilibria. The MA2QL protocol extends IQL by having agents take turns updating their Q-functions in a round-robin fashion, guaranteeing $\varepsilon$ -convergence to a Nash equilibrium via alternating single-agent stationary learning (Su et al., 2022).

2.2 Actor–Critic and Policy-Gradient Frameworks

Decentralized actor–critic protocols leverage local critics and actors. Convergence guarantees exist for both stochastic (Zhang et al., 2019) and deterministic policies in continuous action spaces (Grosnit et al., 2021). The core pattern is that each agent maintains local policy parameters and a value function (critic), exchanging only summarized statistics (e.g., parameter vectors, gradients) with neighbors for consensus, but never raw environmental data or policies for privacy (Li et al., 2021, Grosnit et al., 2021). Off-policy variants enable efficient sample reuse and asynchronous updates (Li et al., 2021).

2.3 Stigmergic and Rule-Based Decentralization

In environments with no explicit communication, stigmergic approaches replace messages with environmental modifications—agents coordinate through local modifications such as "pheromone" fields which guide future agent behavior (Nguyen, 2021). For example, ant-colony-inspired algorithms exploit pheromones for path planning and environment shaping, supporting high scalability and parallel exploration, but at the cost of limited generality and theoretical optimality.

2.4 Communication-Limited and Goal-Aware Methods

Recent work emphasizes selective and context-aware communication: agents communicate only under specific conditions (e.g., shared subgoals, local observations, constraints on visibility), which reduces information overload and enables modular intra-team coordination (Du et al., 15 Nov 2025). Goal-aware selective communication leads to higher effectiveness than indiscriminate information sharing or fully independent learning in tasks such as multi-agent navigation.

2.5 Optimization with Constraints and Privacy

Primal–dual, momentum-based policy gradient techniques have been developed for decentralized multi-agent policy optimization under safety, privacy, and fairness constraints. Methods such as DePAint combine decentralized gradient-tracking over communication graphs, Lagrangian duality formulations, and local policy parameterization to attain convergence to constraint-respecting policies. Empirically, these match or outperform centralized baselines as the agent count grows and offer privacy by preventing the sharing of raw rewards or constraints (Hassan et al., 2023).

2.6 Networked and Mean-Field Approximation

For large populations or structured graphs, localized actor–critic architectures restrict each agent's critic and policy to its $k$ -hop neighborhood. Theoretical results show that as information decays exponentially with distance due to discounting and local coupling, this truncation allows efficient learning with error decaying as $i=1,\dots,n$ 0 (Gu et al., 2021, Hu et al., 2021). Mean-field network formulations further approximate the dynamics for very large or spatially structured agent settings.

2.7 Imitation and Distribution Matching

Imitation learning frameworks enable decentralized agents to learn joint behavior by mimicking centralized expert demonstrations—this bootstraps coordinated behavior while maintaining decentralized execution, provided the expert and demonstration conditions ensure recoverability of joint policies (Lin et al., 2019). Alternatively, distribution-matching approaches such as DM $i=1,\dots,n$ 1 have agents independently match their marginal occupancy distributions to a coordinated expert, achieving joint policy recovery without explicit communication (Wang et al., 2022).

3. Scalability Principles and Theoretical Guarantees

A principal challenge in decentralized MARL is avoiding exponential sample complexity in the number of agents, known as the curse of multiagents. Several algorithmic and analytical advances address this:

Clique-based communication: Regret and sample complexity bounds improve from $i=1,\dots,n$ 2 per group to match centralized rates as the communication graph becomes dense or agents' information-sharing radius increases (Lidard et al., 2021).
Locality in critic/policy: By truncating value/policy networks to $i=1,\dots,n$ 3-hop neighborhoods, error scales as $i=1,\dots,n$ 4 even in tasks with locally coupled dynamics (Gu et al., 2021, Hu et al., 2021).
Variance reduction: Decentralized momentum-based stochastic gradient methods accelerate convergence and stabilize learning (Mao et al., 2021, Hassan et al., 2023).
Alternating update protocols: Turn-based policy or Q-function updates restore single-agent stationarity, control non-stationarity, and guarantee convergence to Nash or correlated equilibria (Su et al., 2022, Mao et al., 2021).
Offline learning convergence: Modified Bellman operators via value deviation and transition normalization guarantee contraction under certain conditions, supporting provable convergence in decentralized offline MARL (Jiang et al., 2021).

Empirical results broadly confirm these theoretical principles, especially in structured environments and under communication topologies where local information suffices due to decaying influence with distance or time (Gu et al., 2021, Hu et al., 2021, Lidard et al., 2021).

4. Representative Results and Applications

Algorithmic Performance

A sample of quantitative results illustrates key trends:

Paper	Scenario	Agents	Success rate / Reward improvement
(Nguyen, 2021)	Multi-agent box-pushing (stigmergic RL)	4–6	$i=1,\dots,n$ 5– $i=1,\dots,n$ 6 in hard tasks
(Hassan et al., 2023)	CoopNav, predator–prey, with constraints	3–5	Reward matches/exceeds centralized
(Du et al., 15 Nov 2025)	Goal-aware navigation	3–10	$i=1,\dots,n$ 7 time-to-goal reduction over baseline
(Lidard et al., 2021)	Decentralized Q-learning with $i=1,\dots,n$ 8-hops	up to 4+	Regret improves as $i=1,\dots,n$ 9, converges to centralized rates

Key observations:

Stigmergic protocols scale to more agents with improved convergence speed, limited only by composability of RL subpolicies (Nguyen, 2021).
Goal-aware, selective communication improves early learning speed and asymptotic performance relative to independent or over-sharing baselines (Du et al., 15 Nov 2025).
Local truncation in networked architectures is effective; performance saturates at moderate $S$ 0, and empirical results show strong reward improvement in real-world tasks (UAV delivery, pandemic mitigation) (Hu et al., 2021).
Decentralized training with privacy/safety constraints matches or improves on centralized constrained methods as the number of agents increases (Hassan et al., 2023).
Minimally invasive protocols (MA2QL) outperform IQL by restoring stationarity without added communication (Su et al., 2022).

Application Domains

Decentralized MARL methods are applied in:

Multi-robot path planning, box-pushing, and formation control (Nguyen, 2021, Gu et al., 2021, Zhang et al., 2019)
Cooperative navigation and smartgrid demand response (Nguyen, 2021, Mishra et al., 2020)
Autonomous vehicles and multi-drone systems (Du et al., 15 Nov 2025)
Pandemic mitigation (region-based lockdowns/control) (Hu et al., 2021)
StarCraft Multi-Agent Challenge for large-scale team coordination (Wang et al., 2022, Su et al., 2022)

5. Limitations, Compositionality, and Open Challenges

Despite progress, decentralized MARL faces fundamental and practical obstacles:

Composability and generalization: RL policies trained in isolation or with specific stigmergic rules do not always compose reliably in new, larger, or more complex multi-agent settings (Nguyen, 2021).
Partial observability and non-stationarity: Strong theoretical guarantees often require full state observability; extension to Dec-POMDPs is non-trivial (Mao et al., 2021, Yoshida et al., 28 May 2025).
Function approximation: Most finite-sample theory is limited to tabular or linear approximations; rigorous results with nonlinear networks are scarce (Gu et al., 2021, Hu et al., 2021).
Communication limits and robustness: Balancing the cost and benefit of information sharing, robustness to communication delays or failures, and prevention of deceptive or adversarial messaging remain open (Lidard et al., 2021, Yoshida et al., 28 May 2025).
Safety and privacy: Real systems require enforcement of per-agent constraints and privacy-preserving learning; only a subset of methods address this formally (Hassan et al., 2023).
Scalability to large agent numbers: While locality and mean-field methods help, scalability is still challenged by state/action complexity and the stability of decentralized updates (Gu et al., 2021, Jiang et al., 2021).

Future research directions focus on integrating deep function approximation with decentralized guarantees, extending to heterogeneous and competitive environments, scalable communication architectures, robust coordination under adversarial or uncertain conditions, and theoretical understanding of emergent behaviors in large decentralized systems.

6. Notable Methodological Innovations

Several methodological trends differentiate recent advances:

Stigmergic and environmental signaling: Using environment modifications as an implicit low-bandwidth communication medium (Nguyen, 2021).
Goal/context-based communication gating: Restricting information exchange to agents with shared objectives or proximity, leveraging structured collaboration (Du et al., 15 Nov 2025).
Decentralized variance-reduction for policy gradients: Applying momentum-based techniques and decentralized gradient tracking, both for privacy and scalability (Hassan et al., 2023, Mao et al., 2021).
Distribution-matching for coordination: Achieving decentralized policy alignment by matching marginal state-action distributions to a (jointly optimal) demonstration, fully decoupling policy learning (Wang et al., 2022).
Reward-independent communication learning: Leveraging auxiliary message channels based on predictive coding, independent of agents’ reward alignment (Yoshida et al., 28 May 2025).
Turn-taking for stability: Structuring update schedules (e.g., MA2QL) to re-establish single-agent stationarity and streamline convergence (Su et al., 2022).
Networked truncation for large-scale systems: Exploiting locality and exponential decay of influence to allow scalable learning in state- or agent-networked settings (Gu et al., 2021, Hu et al., 2021).

7. Conclusion

Decentralized multi-agent reinforcement learning encompasses a diverse set of methodologies unified by the absence of centralized control and the reliance on local information, limited communication, or implicit coordination signals. Advances in algorithmic scalability, robustness to non-stationarity, privacy, constraint satisfaction, and sample efficiency have broadened the class of feasible large-scale cooperative and mixed-agent tasks. Nonetheless, key challenges persist, particularly in deep function approximation, non-stationary and partially observed environments, and theoretical understanding of learning in networked multi-agent systems. Continued integration of locality, communication-efficient protocols, stability constraints, and imitation or distribution-matching paradigms is poised to further advance the practical and theoretical boundaries of decentralized MARL (Nguyen, 2021, Hassan et al., 2023, Du et al., 15 Nov 2025, Lidard et al., 2021, Li et al., 2021, Grosnit et al., 2021, Su et al., 2022, Jiang et al., 2021, Yoshida et al., 28 May 2025, Gu et al., 2021, Mao et al., 2021, Zhang et al., 2019, Wang et al., 2022, Hu et al., 2021, Lin et al., 2019, Altabaa et al., 2023, Zhang et al., 2020, Mishra et al., 2020).