Asynchronous Cooperative Multi-Agent Reinforcement Learning with Limited Communication (2502.00558v2)

Published 1 Feb 2025 in cs.MA

Abstract: We consider the problem setting in which multiple autonomous agents must cooperatively navigate and perform tasks in an unknown, communication-constrained environment. Traditional multi-agent reinforcement learning (MARL) approaches assume synchronous communications and perform poorly in such environments. We propose AsynCoMARL, an asynchronous MARL approach that uses graph transformers to learn communication protocols from dynamic graphs. AsynCoMARL can accommodate infrequent and asynchronous communications between agents, with edges of the graph only forming when agents communicate with each other. We show that AsynCoMARL achieves similar success and collision rates as leading baselines, despite 26\% fewer messages being passed between agents.

Authors (4)

Sydney Dolan (4 papers)
Siddharth Nayak (11 papers)
Jasmine Jerry Aloor (6 papers)
Hamsa Balakrishnan (12 papers)

Summary

AsynCoMARL: Asynchronous Cooperative Multi-Agent Reinforcement Learning with Limited Communication

The paper "Asynchronous Cooperative Multi-Agent Reinforcement Learning with Limited Communication" (Dolan et al., 1 Feb 2025 ) introduces AsynCoMARL, a novel MARL framework designed to address the challenges of cooperative multi-agent tasks in communication-constrained environments. Traditional MARL approaches often assume synchronous operations and frequent communication, rendering them unsuitable for many real-world scenarios. AsynCoMARL tackles this limitation by explicitly modeling asynchronous agent execution and learning communication protocols from dynamic graphs using graph transformers. The framework is evaluated on Cooperative Navigation and Rover-Tower tasks, demonstrating comparable or superior performance to leading baselines with significantly reduced communication overhead.

Key Methodological Components

Asynchronous Formulation and Selective Replay

AsynCoMARL deviates from the conventional synchronous MARL paradigm by assigning each agent i an independent timescale τ(i), which dictates the sequence of actions the agent takes. The intervals between an agent's actions, denoted by τk(i) and τk+1(i), are governed by a randomly sampled interval μ during training. This randomization helps to improve generalization across varying degrees of asynchronicity. An agent is deemed "active" at a global time t only when it is poised to execute an action. Furthermore, AsynCoMARL employs a selective replay buffer, storing only those time steps corresponding to an agent's actions within its τ sequence.

Dynamic Graph Representation and Communication

The environment state is abstracted as a dynamic graph, where nodes represent agents, obstacles, or goals. The feature vector for each node encapsulates information such as relative position, velocity, goal position relative to the agent, and entity type. Edges in the graph are not static; they form dynamically based on proximity and activity. An edge between two agents exists if they are within a communication radius λ and are both active at the same global time step t. Edges also connect agents to obstacles or goals if they are within the communication radius λ, given that the latter are either static or passively observable. The edges are weighted by the Euclidean distance between the connected entities, and the graph is directed. The connectivity and weights are represented by an adjacency matrix 𝔸, which is then masked using an activity matrix 𝔻 to restrict interactions to only currently active nodes, yielding 𝔸masked = 𝔸 ∘ 𝔻. A death mask is also applied to remove agents that have completed their tasks.

Graph Transformer for Message Passing

AsynCoMARL utilizes a graph transformer, specifically based on the UniMP model, to process the dynamic graph structure and facilitate message passing between agents. The graph transformer leverages a multi-head dot-product attention mechanism, enabling agents to selectively prioritize information from connected neighbors based on relevance. This attention mechanism operates on both node and edge features, facilitating the learning of relational representations between entities.

Centralized Training Decentralized Execution (CTDE) with MAPPO

AsynCoMARL adopts the CTDE paradigm, where training is centralized, but execution is decentralized. Each agent possesses an individual actor network that receives its local observation oτ(i) and its local graph encoding gτ(i) as input to select an action aτ(i). A centralized critic, accessible only during training, receives global state information and an aggregated representation of the entire graph, Xagg, obtained via mean pooling of individual agent graph encodings. The critic evaluates actions and guides policy updates. The training procedure adheres to the MAPPO update, employing Generalized Advantage Estimation (GAE) and Adam optimizer. The reward structure incorporates individual rewards for proximity to the goal, reaching the goal (awarded only once), and penalties for collisions. Critically, the total reward at a global time step t is the sum of individual rewards for all agents active at that step, fostering synchronous collaboration.

Experimental Evaluation and Results

The efficacy of AsynCoMARL was assessed in two environments: Cooperative Navigation, a simulation of satellite rendezvous with complex dynamics, and Rover-Tower, a planetary exploration scenario where rovers depend on communication from stationary towers. The performance metrics included Communication Frequency (fcomm), Success Rate (S%), Fraction of Episode Completed (T), and Average Collisions (#col).

In Cooperative Navigation, AsynCoMARL achieved high success rates (e.g., 97% for N=3, 86% for N=10) and relatively low collision rates, on par with strong baselines like AAC and asyncMAPPO. Importantly, it accomplished this while using approximately 26% fewer messages, with fcomm consistently lower across N=3, 5, 7, and 10. AsynCoMARL substantially outperformed methods like GCS, TransfQmix, CACOM, and DGN in terms of success rate under communication constraints.

In the Rover-Tower environment, AsynCoMARL attained a success rate of 50%, comparable to the best baseline (AAC, 56%), which utilized separate networks for rovers and towers. However, AsynCoMARL exhibited considerably less communication (fcomm 0.14 vs 0.21) and faster task completion (T 0.55 vs 0.84), demonstrating efficiency and generalizability.

Ablation Studies and Analysis

Ablation studies revealed that removing the graph transformer led to a significant degradation in performance (lower success rates, higher collisions), especially when fewer agents were simultaneously active, underscoring the importance of learned graph representations. Furthermore, the specific reward structure, featuring a single goal reward shared only among currently active agents, proved superior to alternative reward schemes, highlighting its effectiveness in promoting collaboration in the asynchronous setting. Analysis of the graph transformer's attention weights showed that they adapted dynamically, focusing on agents that were both nearby and/or communicated more frequently, indicating the model learns to balance proximity and communication reliability.

Conclusion

AsynCoMARL demonstrates a practical approach to MARL in communication-constrained environments. The combination of asynchronous agent execution, dynamic graph representations, and graph transformer-based communication enables effective coordination with significantly reduced communication overhead. The tailored reward structure and the graph-based attention mechanism are key to coordinating agents asynchronously, offering a promising direction for future research.

PDF Markdown