Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

Published 15 Apr 2026 in cs.LG, cs.AI, and cs.MA | (2604.13472v1)

Abstract: Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi-Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single-agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision-making mechanism in which a Transformer decoder autoregressively generates a high-level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order-independent joint decision making and avoiding the sensitivity to action-generation order in conventional Multi-Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single-agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi-Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces CMAT, a Transformer-based approach that generates an order-independent latent consensus for multi-agent coordination.
It employs a hierarchical SARL formulation with joint policy optimization via PPO, effectively addressing action-order biases and credit assignment issues.
Empirical evaluations on benchmarks like StarCraft II and Google Research Football demonstrate superior performance and reduced training variance compared to conventional MARL methods.

Consensus Multi-Agent Transformer (CMAT): Bridging Cooperative MARL to Hierarchical SARL via Latent Consensus

Motivation and Problem Statement

Cooperative multi-agent reinforcement learning (MARL) systems are widely deployed in scenarios where centralized control is feasible and global coordination is required (e.g., ride-hailing dispatch, traffic signal control, robotic swarms). The exponential joint observation and action spaces impose severe scalability constraints—the Curse of Dimensionality (CoD)—which decentralization alleviates at the cost of introducing non-stationarity, poor credit assignment, and weakened theoretical guarantees. Centralized Training Decentralized Execution (CTDE) methods leverage centralized critics to stabilize training but ultimately restrict empirical coordination during deployment. Centralized solutions like Multi-Agent Transformer (MAT) capture inter-agent dependencies by sequentially generating actions conditioned on joint observations, but they suffer from fundamental order-dependency and optimization limitation: convergence only to Nash Equilibria, typically suboptimal in cooperative settings.

CMAT introduces a hierarchical consensus-generation mechanism that redefines cooperative MARL as a SARL paradigm. The CMAT architecture models agents as a unified entity and employs a Transformer encoder for global observation processing. Its principal innovation is the order-independent, autoregressive generation of a latent consensus vector in the Transformer decoder, simultaneously conditioning agent action policies on this consensus, thereby circumventing the action-generation order sensitivity inherent to MAT and related methods.

Figure 1: Comparison between CMAT and conventional decentralized MARL methods, depicting the transition from order-dependent sequential decision-making to order-independent consensus-based coordination.

Architecture and Methodological Innovations

The CMAT network architecture consists of an order-independent Transformer encoder, consensus-generating decoder, and action modules:

The encoder omits positional embeddings and utilizes bi-directional self-attention to extract agent-relational features, producing a joint embedding sequence.
A Critic-Compressor compresses this sequence into an initial consensus vector serving dual roles: value estimation and iterative refinement via the decoder.
The decoder autoregressively refines this latent consensus over $m$ iterations (where $m$ is typically set to the number of agents), retaining positional embeddings to preserve convergence dynamics.
An Actor-Compressor aggregates all consensus vectors into a final consensus representation, preventing information loss during iteration.
The Actor-MLP modules independently condition agent actions on their local observation embeddings and the global consensus vector.
Figure 2: CMAT network architecture showcasing encoder-driven feature extraction, V-value estimation, iterative consensus generation in the decoder, and simultaneous agent action conditioning on the final consensus.

CMAT formulates the joint policy as $\pi(\mathcal{A}|\mathcal{O}) = \pi^c(c|\mathcal{O}) \prod_{i=1}^n \pi^i(a^i|\mathcal{O},c)$ , where the deterministic consensus $c = \mu_\theta(\mathcal{O})$ serves as a coordination signal.

Training, Fine-Tuning, and Theoretical Justification

CMAT training utilizes a SARL perspective, optimizing the joint policy with Proximal Policy Optimization (PPO) and GAE-based advantage estimation. The consensus and agent policies are jointly optimized, with a subsequent fine-tuning phase—either consensus enhancement (fix action heads) or action policy enhancement (fix consensus generator)—facilitating mutual disentanglement and further performance improvements. Theoretical justification, discussed in the appendix, demonstrates that CMAT realizes a cooperative Stackelberg game, achieving order independence and Pareto improvements over sequential frameworks like MAT; under tabular assumptions, block coordinate ascent ensures monotonic improvement towards Stackelberg equilibria.

Experimental Evaluation

CMAT is evaluated on challenging cooperative MARL benchmarks: StarCraft II, Multi-Agent MuJoCo, and Google Research Football. Across all tasks (e.g., MMM2, 6h vs 8z, 3s5z vs 3s6z, Ant-8 $\times$ 1, HalfCheetah-6 $\times$ 1, Walker2d-6 $\times$ 1, Football Academy scenarios), CMAT and its fine-tuned variants decisively outperform MAT, PMAT, Triple-BERT, HAPPO, MAPPO, and other recent baselines, as evidenced by superior training curves and lower performance variance.

Figure 3: Training curves under 5 random seeds across multiple benchmarks; CMAT and its fine-tuned variants exhibit consistently superior performance and reduced variance.

The ablation study further validates the role of Actor-Compressor (consensus mixture vs last consensus) and the impact of consensus iteration times. Performance degrades when only the last consensus vector is used, indicating information loss, and optimality is achieved when iteration times match the agent count. Too few iterations lead to under-coordination; excessive iterations introduce unnecessary noise and complexity.

Figure 4: Ablation study results demonstrate the necessity of consensus mixture and proper iteration count for optimal performance on key benchmarks.

Practical and Theoretical Implications

CMAT's transition from MARL to SARL, anchored on order-independent consensus generation, resolves several long-standing issues in cooperative MARL: it mitigates actor-critic inconsistency, credit assignment ambiguities, and order-induced bias. Empirically, CMAT achieves strong numerical superiority over all recent centralized and sequential MARL baselines. The theoretical framework suggests that consensus-guided hierarchical optimization admits a richer policy class, enabling solutions beyond Nash equilibria towards global optima. Practically, CMAT is positioned for deployment in fully observable, centralized operational environments (e.g., urban resource allocation, dynamic fleet management), contingent on scalability and communication considerations.

Future Directions

Despite its demonstrated efficacy, CMAT's reliance on full observability and centralized computation is a constraint; future investigations should focus on large-scale, realistic deployments with communication-efficient variants. The potential for in-context generalization, transfer learning, and few-shot adaptation leveraging Transformer-based latent consensus remains unexplored. Rigorous theoretical convergence proofs under deep function approximation are still open challenges, as are new consensus mechanisms for partially observable and decentralized settings.

Conclusion

CMAT inaugurates a hierarchical SARL formulation for cooperative MARL through order-independent latent consensus generation. The architecture eliminates sequential order sensitivity, achieves stronger empirical and theoretical guarantees, and sets a new benchmark across diverse cooperative tasks. Its modular consensus mechanism and robust optimization regime mark a substantial advance in the design and analysis of centralized cooperative MARL systems.

Markdown Report Issue