Papers
Topics
Authors
Recent
Search
2000 character limit reached

Actor-Attention-Critic (MAAC)

Updated 24 February 2026
  • Actor-Attention-Critic (MAAC) is a multi-agent reinforcement learning framework that integrates centralized, attention-based critics with decentralized actor policies.
  • It uses a soft actor-critic formulation to optimize policies in both cooperative and competitive settings while addressing non-stationarity and scalability challenges.
  • Empirical results show that MAAC outperforms traditional methods, effectively handling complex multi-agent scenarios with fixed per-agent computations and interpretable attention mechanisms.

The Actor-Attention-Critic (MAAC) architecture is a multi-agent reinforcement learning (MARL) algorithm that integrates centralized critics with learned attention mechanisms to address the challenges of scalability, non-stationarity, and partial observability endemic to multi-agent domains. MAAC supports both cooperative and competitive settings, requires only decentralized execution, and achieves state-of-the-art performance on complex multi-agent benchmarks. The MAAC framework forms the basis for scalable MARL in both standard and constrained environments, and provides practical advantages over previous centralized-critic approaches by employing attention to selectively aggregate information from other agents (Iqbal et al., 2018, Jeon et al., 2020, Parnika et al., 2021).

1. Multi-Agent Markov Games and Centralized Training with Decentralized Execution

MAAC builds on the formalism of an NN-agent partially-observable Markov game defined as (S,{Oi},{Ai},T,{Ri},γ)(S, \{O_i\}, \{A_i\}, T, \{R_i\}, \gamma), where each agent ii receives local observation oiOio_i \in O_i and takes action aiAia_i \in A_i. The joint transition kernel TT and agent-specific rewards RiR_i are available only during training, enabling centralized critic learning, while each actor policy πθi(aioi)\pi_{\theta_i}(a_i|o_i) is conditioned solely on its own observation for decentralized execution.

This centralized training with decentralized execution (CTDE) paradigm addresses the non-stationarity inherent to MARL by providing each critic with access to the joint observation and action space during learning, while ensuring that the learned policies are deployable without global state (Iqbal et al., 2018).

2. Soft Actor-Critic Actor-Critic Formulation and Attention-Based Critic Architecture

MAAC employs a maximum-entropy (soft) actor-critic algorithm. Each agent ii maintains:

  • A decentralized stochastic actor πθi(aioi)\pi_{\theta_i}(a_i|o_i);
  • A centralized, attention-based critic Qiψ(o1,...,oN,a1,...,aN)Q^\psi_i(o_1, ..., o_N, a_1, ..., a_N).

The critics employ an intra-critic attention mechanism, wherein each agent's local embedding ei=gi(oi,ai)e_i = g_i(o_i, a_i) is projected into query, key, and value spaces by shared learnable matrices Wq,Wk,VW_q, W_k, V. Unnormalized attention weights α~ij\tilde\alpha_{ij} and normalized weights αij\alpha_{ij} are computed from key‒query dot products, and the context vector xix_i is formed as an attention-weighted sum over other agents' value vectors. The critic output is then Qiψ(o,a)=fi(ei,xi)Q^\psi_i(o,a) = f_i(e_i, x_i), with fif_i a learned multi-layer perceptron. Extensions to HH-head attention allow richer relational reasoning (Iqbal et al., 2018, Jeon et al., 2020).

3. Policy Optimization and Off-Policy Training Dynamics

Policy optimization in MAAC uses the soft actor-critic gradient, with an advantage function derived from the centralized critic and a multi-agent baseline bi(o,ai)b_i(o, a_{\setminus i}):

θiJi(θi)=Eo,a[θilogπθi(aioi)(αlogπθi(aioi)+Qiψ(o,a)bi(o,ai))]\nabla_{\theta_i}J_i(\theta_i) = \mathbb{E}_{o,a} \left[ \nabla_{\theta_i} \log \pi_{\theta_i}(a_i|o_i) \left( -\alpha \log \pi_{\theta_i}(a_i|o_i) + Q^\psi_i(o,a) - b_i(o,a_{\setminus i}) \right) \right]

with bib_i marginalizing over actions for variance reduction. The critic is trained off-policy by minimizing the squared soft Bellman error across agents.

The training loop alternates between interaction with the environment (joint sampling of (o1,...,oN,a1,...,aN,r1,...,rN,o1,...,oNo_1, ..., o_N, a_1, ..., a_N, r_1, ..., r_N, o'_1, ..., o'_N)), critic update steps (minimizing squared temporal-difference error), actor update steps (via policy gradient with advantage), and target network Polyak averaging. Experience is stored in a shared off-policy replay buffer (Iqbal et al., 2018, Jeon et al., 2020).

4. Scalability, Network Design, and Key Hyperparameters

Conventional centralized-critic methods, such as MADDPG and COMA, concatenate all agents' observations and actions, resulting in input and network sizes that scale linearly with the number of agents NN. In contrast, MAAC's attention-based critics maintain fixed per-agent input and parameter dimensions, scaling only computation with O(HN)O(HN) forward passes and O(N2denc)O(N^2 d_{\text{enc}}) in practice, and enabling parameter sharing for efficient multitask regression.

Key architectural elements (typical settings from (Jeon et al., 2020)):

  • Shared encoder MLP: hidden layers 128 units;
  • Local/global critic branches: 64 units;
  • Key/query/value attention projections: 64 units each;
  • Output MLP: 128 units;
  • Replay buffer size: 1.25×1061.25 \times 10^6;
  • Critic/actor learning rate: 1×1031 \times 10^{-3};
  • Polyak averaging (τ\tau): $0.0005$;
  • Minibatch size: 1000;
  • Entropy bonus: $0.01$.

Replay buffer size and batch size are critical for stability, as is slow target network updating for large agent populations. Single-head attention suffices for N16N \leq 16; multiple heads enhance capacity for larger populations (Jeon et al., 2020).

5. Constrained Extensions and Multiple Attention Modes

MAAC has been extended to constrained cooperative settings with the introduction of multiple critics, each equipped with its own attention mechanism. For constraints expressed as expectations of single-stage costs, a Lagrangian framework is adopted, with the primary critic optimizing the cumulative cost plus weighed constraints, and separate "penalty critics" for each constraint.

Each critic utilizes separate sets of key/query/value projections, supporting distinct attention "modes" for optimizing the cooperative objective and satisfying constraints. Dual ascent in the Lagrangian multipliers is applied on a slower timescale, and attention heatmap analysis reveals interpretable specialization in how critics aggregate information most relevant to their respective objectives or constraints (Parnika et al., 2021).

6. Empirical Evaluation and Benchmarks

MAAC has been empirically validated on several multi-agent tasks:

  • Cooperative Navigation (3 agents);
  • Cooperative Treasure Collection (up to 12 agents, mixed rewards);
  • Rover-Tower Communication (8 agents).

MAAC consistently outperforms or matches baseline CTDE approaches (MADDPG, COMA), with the performance margin growing with the number of agents. In the Cooperative Treasure Collection, the relative advantage over MADDPG+SAC scales from 17% (N=4N=4) to 208% (N=12N=12). Attention analysis shows that the network learns to prioritize relevant agent pairs (e.g., each rover attends to its tower), correlating with specialized interaction patterns (Iqbal et al., 2018, Jeon et al., 2020).

In constrained MARL settings, MAAC-based algorithms (e.g., MACAAC) achieve constraint satisfaction (collision avoidance, safe exploration) while optimizing the main reward, outperforming constrained MADDPG both in constraint violation rates and main objective performance. Analysis of learned attention for different critics reveals interpretable source selection: e.g., penalty critics focus on fellow agents for constraint estimation, while the Lagrangian critic focuses on task-relevant subgroups (Parnika et al., 2021).

7. Limitations and Directions for Extension

MAAC requires access to all agents' local observations and actions during training, but not necessarily the global environment state. Discrete action counterfactual baselines are exact; continuous versions require Monte Carlo estimation or auxiliary value networks. The scalability benefit depends on parameter sharing across critics: if agent reward structures diverge, partial decoupling may be needed. The architecture is highly extensible, admitting hierarchical attention, richer inter-agent communication, integration with centralized-policy attention, adversarial training, and applications in multi-agent inverse RL (Iqbal et al., 2018, Jeon et al., 2020, Parnika et al., 2021).

Actor-Attention-Critic sets the state of the art for scalable, interpretable, and extensible multi-agent actor-critic learning, providing methodological advances that generalize to large-scale and constrained MARL domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Actor-Attention-Critic (MAAC).