- The paper introduces CMAT, a Transformer-based approach that generates an order-independent latent consensus for multi-agent coordination.
- It employs a hierarchical SARL formulation with joint policy optimization via PPO, effectively addressing action-order biases and credit assignment issues.
- Empirical evaluations on benchmarks like StarCraft II and Google Research Football demonstrate superior performance and reduced training variance compared to conventional MARL methods.
Motivation and Problem Statement
Cooperative multi-agent reinforcement learning (MARL) systems are widely deployed in scenarios where centralized control is feasible and global coordination is required (e.g., ride-hailing dispatch, traffic signal control, robotic swarms). The exponential joint observation and action spaces impose severe scalability constraintsโthe Curse of Dimensionality (CoD)โwhich decentralization alleviates at the cost of introducing non-stationarity, poor credit assignment, and weakened theoretical guarantees. Centralized Training Decentralized Execution (CTDE) methods leverage centralized critics to stabilize training but ultimately restrict empirical coordination during deployment. Centralized solutions like Multi-Agent Transformer (MAT) capture inter-agent dependencies by sequentially generating actions conditioned on joint observations, but they suffer from fundamental order-dependency and optimization limitation: convergence only to Nash Equilibria, typically suboptimal in cooperative settings.
CMAT introduces a hierarchical consensus-generation mechanism that redefines cooperative MARL as a SARL paradigm. The CMAT architecture models agents as a unified entity and employs a Transformer encoder for global observation processing. Its principal innovation is the order-independent, autoregressive generation of a latent consensus vector in the Transformer decoder, simultaneously conditioning agent action policies on this consensus, thereby circumventing the action-generation order sensitivity inherent to MAT and related methods.
Figure 1: Comparison between CMAT and conventional decentralized MARL methods, depicting the transition from order-dependent sequential decision-making to order-independent consensus-based coordination.
Architecture and Methodological Innovations
The CMAT network architecture consists of an order-independent Transformer encoder, consensus-generating decoder, and action modules:
- The encoder omits positional embeddings and utilizes bi-directional self-attention to extract agent-relational features, producing a joint embedding sequence.
- A Critic-Compressor compresses this sequence into an initial consensus vector serving dual roles: value estimation and iterative refinement via the decoder.
- The decoder autoregressively refines this latent consensus over m iterations (where m is typically set to the number of agents), retaining positional embeddings to preserve convergence dynamics.
- An Actor-Compressor aggregates all consensus vectors into a final consensus representation, preventing information loss during iteration.
- The Actor-MLP modules independently condition agent actions on their local observation embeddings and the global consensus vector.
Figure 2: CMAT network architecture showcasing encoder-driven feature extraction, V-value estimation, iterative consensus generation in the decoder, and simultaneous agent action conditioning on the final consensus.
CMAT formulates the joint policy as ฯ(AโฃO)=ฯc(cโฃO)i=1โnโฯi(aiโฃO,c), where the deterministic consensus c=ฮผฮธโ(O) serves as a coordination signal.
Training, Fine-Tuning, and Theoretical Justification
CMAT training utilizes a SARL perspective, optimizing the joint policy with Proximal Policy Optimization (PPO) and GAE-based advantage estimation. The consensus and agent policies are jointly optimized, with a subsequent fine-tuning phaseโeither consensus enhancement (fix action heads) or action policy enhancement (fix consensus generator)โfacilitating mutual disentanglement and further performance improvements. Theoretical justification, discussed in the appendix, demonstrates that CMAT realizes a cooperative Stackelberg game, achieving order independence and Pareto improvements over sequential frameworks like MAT; under tabular assumptions, block coordinate ascent ensures monotonic improvement towards Stackelberg equilibria.
Experimental Evaluation
CMAT is evaluated on challenging cooperative MARL benchmarks: StarCraft II, Multi-Agent MuJoCo, and Google Research Football. Across all tasks (e.g., MMM2, 6h vs 8z, 3s5z vs 3s6z, Ant-8ร1, HalfCheetah-6ร1, Walker2d-6ร1, Football Academy scenarios), CMAT and its fine-tuned variants decisively outperform MAT, PMAT, Triple-BERT, HAPPO, MAPPO, and other recent baselines, as evidenced by superior training curves and lower performance variance.









Figure 3: Training curves under 5 random seeds across multiple benchmarks; CMAT and its fine-tuned variants exhibit consistently superior performance and reduced variance.
The ablation study further validates the role of Actor-Compressor (consensus mixture vs last consensus) and the impact of consensus iteration times. Performance degrades when only the last consensus vector is used, indicating information loss, and optimality is achieved when iteration times match the agent count. Too few iterations lead to under-coordination; excessive iterations introduce unnecessary noise and complexity.









Figure 4: Ablation study results demonstrate the necessity of consensus mixture and proper iteration count for optimal performance on key benchmarks.
Practical and Theoretical Implications
CMAT's transition from MARL to SARL, anchored on order-independent consensus generation, resolves several long-standing issues in cooperative MARL: it mitigates actor-critic inconsistency, credit assignment ambiguities, and order-induced bias. Empirically, CMAT achieves strong numerical superiority over all recent centralized and sequential MARL baselines. The theoretical framework suggests that consensus-guided hierarchical optimization admits a richer policy class, enabling solutions beyond Nash equilibria towards global optima. Practically, CMAT is positioned for deployment in fully observable, centralized operational environments (e.g., urban resource allocation, dynamic fleet management), contingent on scalability and communication considerations.
Future Directions
Despite its demonstrated efficacy, CMAT's reliance on full observability and centralized computation is a constraint; future investigations should focus on large-scale, realistic deployments with communication-efficient variants. The potential for in-context generalization, transfer learning, and few-shot adaptation leveraging Transformer-based latent consensus remains unexplored. Rigorous theoretical convergence proofs under deep function approximation are still open challenges, as are new consensus mechanisms for partially observable and decentralized settings.
Conclusion
CMAT inaugurates a hierarchical SARL formulation for cooperative MARL through order-independent latent consensus generation. The architecture eliminates sequential order sensitivity, achieves stronger empirical and theoretical guarantees, and sets a new benchmark across diverse cooperative tasks. Its modular consensus mechanism and robust optimization regime mark a substantial advance in the design and analysis of centralized cooperative MARL systems.