Multi-Agent A2C: Decentralized MARL Framework

Updated 21 November 2025

Multi-Agent A2C (MA2C) is a decentralized MARL framework that extends A2C to cooperative, partially observable environments using local policy evaluation and inter-agent communication.
It employs spatial reward mixing and advanced neural architectures, including dual-branch FC layers, LSTM, and graph attention networks, to enhance stability and scalability.
Empirical results in domains such as traffic control, vehicular coordination, and resource management demonstrate significant improvements over independent actor-critic baselines.

Multi-Agent Advantage Actor-Critic (MA2C) is a decentralized multi-agent reinforcement learning (MARL) framework extending the Advantage Actor-Critic (A2C) approach to cooperative, partially observable environments. MA2C addresses critical challenges in MARL such as non-stationarity, credit assignment, observability, and scalability through local policy evaluation, explicit inter-agent communication, spatial reward mixing, and advanced neural architectures. The methodology has demonstrated state-of-the-art performance across diverse domains including large-scale adaptive traffic signal control, multi-vehicle coordination, resource management, beamforming, and ride-sourcing, substantially improving robustness, efficiency, and learning stability compared to independent actor-critic baselines.

1. Mathematical Foundations and Algorithmic Structure

Let the environment be formalized as a set of $N$ agents indexed by $i\in\mathcal{V}$ . Each agent $i$ operates in a partially observable domain, maintains its own policy (actor) parameterized by $\theta_i$ and critic parameterized by $\psi_i$ , and interacts with neighboring agents $\mathcal{N}_i$ . The agent’s local observation at step $t$ is $s_{t,i}$ , and each step, agents exchange "policy fingerprint" vectors—typically the most recent action distribution $\pi_{t-1,i}(\cdot)$ . The generic MA2C update procedure comprises:

Observation Augmentation: Each agent receives the softmax outputs ("fingerprints") of its neighbors’ policies from the previous time step, forming an augmented input of the form $(s_{t,i}, \pi_{t-1,\mathcal{N}_i})$ .
Spatial Reward Mixing: To address the non-stationarity from evolving neighbor policies and improve credit assignment, an agent’s own reward $r_{t,i}$ is blended with those of its neighbors by spatial discounting coefficient $\alpha$ :

$\tilde{r}_{t,i} = \frac{1}{|\mathcal{V}_i|}\left(r_{t,i} + \alpha \sum_{j\in\mathcal{N}_i} r_{t,j} \right)$

where $\mathcal{V}_i = \{ i \} \cup \mathcal{N}_i$ .

Return and Advantage Estimation: The $n$ -step return is

$\tilde{R}_{t,i} = \sum_{\tau=t}^{t_B-1} \gamma^{\tau-t} \tilde{r}_{\tau,i} + \gamma^{t_B-t} V_{\psi_i^-}(\tilde{h}^V_{t_B,\mathcal{V}_i})$

with $V_{\psi_i^-}$ a target critic. The advantage is computed as

$\tilde{A}_{t,i} = \tilde{R}_{t,i} - V_{\psi_i^-}(\tilde{h}^V_{t,\mathcal{V}_i})$

Policy and Critic Losses: The actor (policy) loss with entropy regularization is

$L_\pi(\theta_i) = -\sum_{t=0}^{t_B-1} \left[ \log \pi_{\theta_i}(u_{t,i}|\tilde{h}^\pi_{t,\mathcal{V}_i}) \cdot \tilde{A}_{t,i} + \beta H(\pi_{\theta_i}(\cdot|\tilde{h}^\pi_{t,\mathcal{V}_i})) \right]$

The critic (value) loss is

$L_v(\psi_i) = \frac{1}{2}\sum_{t=0}^{t_B-1} \left(\tilde{R}_{t,i} - V_{\psi_i}(\tilde{h}^V_{t,\mathcal{V}_i}) \right)^2$

Parameter Update: RMSProp or Adam with gradient clipping are standard; updates are local to each agent but benefit from fingerprint-based coordination (Fazzini et al., 2021, Chu et al., 2019).

2. Network Architectures and Communication Protocols

A broad variety of neural architectures are employed in MA2C. The canonical traffic signal implementation (Chu et al., 2019, Fazzini et al., 2021) utilizes:

Dual-Branch FC Layers: Parallel fully connected layers encode local state and neighbor fingerprints, concatenated and processed by LSTM units to handle temporal dependencies.
Separated Actor-Critic Heads: The policy head outputs a softmax over discrete actions (e.g., traffic light phases), while the critic head estimates local value functions.
Recurrent, Parameter-Sharing, and Graph Neural Extensions: LSTM modules are standard for temporal dependencies. Parameter sharing across agents is adopted in homogeneous or symmetric multi-agent domains, e.g., vehicle coordination (Zhou et al., 2021), air traffic control (Brittain et al., 2019), and ride-sourcing (Ke et al., 2019). Graph attention networks (GATs) have been integrated in network resource allocation to encode complex spatial inter-agent dependencies (Shao et al., 2021).
Inter-Agent Communication: Agents at each timestep share compact policy fingerprints (softmax action distributions), which serve to mitigate environment non-stationarity and provide contextual intent modeling, significantly stabilizing multi-agent training (Fazzini et al., 2021, Chu et al., 2019).

3. Theoretical Rationale: Partial Observability, Non-Stationarity, and Credit Assignment

MA2C’s mixed observability model formally addresses the inherent non-stationarity that arises when multiple adaptive agents’ policies evolve concurrently. By conditioning local policies and critics on neighbors’ recent actions (through fingerprints), the effective observation space approximates a stationary local sub-MDP, which allows standard actor-critic convergence guarantees to hold more robustly. Spatial mixing of rewards and localized critics further improves both sample efficiency and credit assignment, focusing the learning on directly relevant sub-domains while still promoting coordinated behavior (Fazzini et al., 2021, Chu et al., 2019).

4. Domain-Specific Adaptations and Hyperparameters

Traffic Systems

MA2C for adaptive traffic signal control typifies the algorithm’s high-dimensional, distributed success. Each intersection acts as an agent; agents optimize local signal phases, receiving and providing fingerprints to direct neighbors, and leveraging spatially discounted rewards (Chu et al., 2019, Fazzini et al., 2021). Hyperparameters such as spatial discount $\alpha \in [0.75, 0.9]$ , temporal discount $\gamma=0.99$ , entropy regularization weight $\beta=0.01$ , LSTM hidden units (64), and batch sizes (40–120 timesteps) contribute to robust learning.

Multi-Agent Vehicular Control

For cooperative lane-changing of AVs among HDVs, all AV agents share an actor-critic network processing matrix-valued relative positions and velocities. Local rewards are multi-objective, balancing safety, efficiency (log-headway, speed), and comfort (acceleration/lane-change penalties), with reward weights (e.g. $\omega_s = 200, \omega_d = 4$ ) tuned for behavioral priorities (Zhou et al., 2021).

Communications and Resource Management

GAT-based MA2C encodes local and neighborhood demand features for resource slicing in dense cellular networks, yielding significant improvements in utility and slice satisfaction (Shao et al., 2021). For extremely large RIS-aided beamforming, a sequential MA2C approach assigns sub-arrays or phase blocks to specialized agents, with a centralized critic and sequential activation to ensure tractable credit assignment (Chai et al., 12 Jun 2025).

Ride-Sourcing and Routing

In ride-sourcing matching, each passenger request is an agent, with states including both global and local gridwise features, and centralized training/sharing of network parameters. MA2C produces significant reductions in pickup time and variance, outperforming DQN and tabular approaches (Ke et al., 2019). For network routing in access-backhaul scenarios, MA2C can be enhanced with “relational” encodings to bias packet policies based on destination group identities, attaining near-centralized optimal performance even under full decentralization (Yamin et al., 2023).

5. Comparative Empirical Results

Across domains, extensive experiments demonstrate that MA2C consistently outperforms both independent actor-critic (IA2C/IAC) and Q-learning baselines:

Domain	MA2C Return	Best Baseline	Notes
Traffic signal control (synthetic grid)	Highest	IA2C	~50–70% peak queue reduction, lower delays (Chu et al., 2019)
Lane-changing in AV-HDV mixtures	58.00 ± 9.31	47.45 ± 27.95 (MADQN)	More stable, robust safety/comfort (Zhou et al., 2021)
Air-traffic control (conflict-free %)	99.97–100%	N/A (no baseline match)	Centralized training, decentralized execution (Brittain et al., 2019)
Beamforming (sum-SE improvement)	Outperforms ZF	ZF, alternating BCD	Robust to channel estimation error (Chai et al., 12 Jun 2025)
Resource slicing (utility)	+5–7%	vanilla A2C	GAT-MA2C converges faster (Shao et al., 2021)

Ablation studies confirm that communication (fingerprints), spatial reward mixing, and parameter sharing each independently contribute to improved convergence, credit assignment, and stability (Fazzini et al., 2021, Chu et al., 2019, Zhou et al., 2021).

6. Limitations, Extensions, and Open Problems

MA2C can exhibit sample inefficiency in extremely large networks due to purely on-policy updates, leading to proposals for off-policy or hybrid variants (e.g., using DDPG or SAC in continuous domains) (Zhou et al., 2021). Formal safety is guaranteed only via reward shaping; the inclusion of hard constraints (e.g., control barrier functions or shielded RL) remains a research challenge. Scalability bottlenecks may arise with dense communication; related innovations involve partial communication or graph-based message passing (Shao et al., 2021).

Recent work on value-decomposition actor-critics (VDACs) demonstrates that monotonic mixing of local state-value critics into a global $V_\text{tot}$ can be naturally integrated into A2C-style training, improving coordination in high-dimensional domains. VDAC-mix outperforms QMIX under A2C rollouts on challenging multi-agent benchmarks (Su et al., 2020).

Variants such as Latent Interactive A2C (LIA2C) incorporate encoder-decoder architectures to model latent state and agent populations, especially under population uncertainty or open-agent systems, yielding lower variance and faster convergence (He et al., 2023).

7. Summary Table: Architectural and Algorithmic Features

Feature/Property	Canonical MA2C	GAT-MA2C	Sequential MA2C (RIS)
Critic type	Local, recurrent	Local+graph, attention-based	Centralized, global
Inter-agent comm.	Softmax fingerprints	GAT over local embeddings	Sequential block update
Reward aggregation	Spatially discounted	Local cell utility	Episode sum-SE
Policy sharing	Optional	Shared or local	Distinct sub-array/BS
Target update	Delayed/target critic	Target value	Huber loss/no baseline
Key domains	Traffic, multi-robot	Resource management	Beamforming

References

Multi-Agent Deep Reinforcement Learning for Large-scale Traffic Signal Control (Chu et al., 2019)
Effects of Smart Traffic Signal Control on Air Quality (Fazzini et al., 2021)
Multi-agent Reinforcement Learning for Cooperative Lane Changing of Connected and Autonomous Vehicles in Mixed Traffic (Zhou et al., 2021)
Autonomous Air Traffic Controller: A Deep Multi-Agent Reinforcement Learning Approach (Brittain et al., 2019)
Joint Beamforming with Extremely Large Scale RIS: A Sequential Multi-Agent A2C Approach (Chai et al., 12 Jun 2025)
Graph Attention Network-based Multi-agent Reinforcement Learning for Slicing Resource Management in Dense Cellular Network (Shao et al., 2021)
Value-Decomposition Multi-Agent Actor-Critics (Su et al., 2020)
Latent Interactive A2C for Improved RL in Open Many-Agent Systems (He et al., 2023)
Optimizing Online Matching for Ride-Sourcing Services with Multi-Agent Deep Reinforcement Learning (Ke et al., 2019)
Multi-Agent Reinforcement Learning for Network Routing in Integrated Access Backhaul Networks (Yamin et al., 2023)