Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inferring Latent Temporal Sparse Coordination Graph for Multi-Agent Reinforcement Learning

Published 28 Mar 2024 in cs.LG and cs.MA | (2403.19253v2)

Abstract: Effective agent coordination is crucial in cooperative Multi-Agent Reinforcement Learning (MARL). While agent cooperation can be represented by graph structures, prevailing graph learning methods in MARL are limited. They rely solely on one-step observations, neglecting crucial historical experiences, leading to deficient graphs that foster redundant or detrimental information exchanges. Additionally, high computational demands for action-pair calculations in dense graphs impede scalability. To address these challenges, we propose inferring a Latent Temporal Sparse Coordination Graph (LTS-CG) for MARL. The LTS-CG leverages agents' historical observations to calculate an agent-pair probability matrix, where a sparse graph is sampled from and used for knowledge exchange between agents, thereby simultaneously capturing agent dependencies and relation uncertainty. The computational complexity of this procedure is only related to the number of agents. This graph learning process is further augmented by two innovative characteristics: Predict-Future, which enables agents to foresee upcoming observations, and Infer-Present, ensuring a thorough grasp of the environmental context from limited data. These features allow LTS-CG to construct temporal graphs from historical and real-time information, promoting knowledge exchange during policy learning and effective collaboration. Graph learning and agent training occur simultaneously in an end-to-end manner. Our demonstrated results on the StarCraft II benchmark underscore LTS-CG's superior performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. M. Wang, L. Wu, J. Li, and L. He, “Traffic signal control with reinforcement learning based on region-aware cooperative strategy,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 7, pp. 6774–6785, 2022.
  2. Y. Rizk, M. Awad, and E. W. Tunstel, “Cooperative heterogeneous multi-robot systems: A survey,” ACM Comput. Surv., vol. 52, no. 2, pp. 29:1–29:31, 2019.
  3. J. Cui, Y. Liu, and A. Nallanathan, “Multi-agent reinforcement learning-based resource allocation for UAV networks,” IEEE Trans. Wirel. Commun., vol. 19, no. 2, pp. 729–743, 2020.
  4. P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS 2018), Stockholm, Sweden, 2018, pp. 2085–2087.
  5. T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. N. Foerster, and S. Whiteson, “QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning,” in the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmässan, Stockholm, Sweden, vol. 80, 2018, pp. 4292–4301.
  6. K. Son, D. Kim, W. J. Kang, D. Hostallero, and Y. Yi, “QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning,” in the 36th International Conference on Machine Learning (ICML 2019), Long Beach, California, USA, vol. 97, 2019, pp. 5887–5896.
  7. Y. Hong, Y. Jin, and Y. Tang, “Rethinking individual global max in cooperative multi-agent reinforcement learning,” in the 36th Annual Conference on Neural Information Processing Systems (NIPS 2022), vol. 35, 2022, pp. 32 438–32 449.
  8. C. Guestrin, M. G. Lagoudakis, and R. Parr, “Coordinated reinforcement learning,” in the 19th International Conference (ICML 2002), University of New South Wales, Sydney, Australia, 2002, pp. 227–234.
  9. I.-J. Liu, R. A. Yeh, and A. G. Schwing, “Pic: Permutation invariant critic for multi-agent deep reinforcement learning,” in the 3rd Conference on Robot Learning (CoRL 2019), Osaka, Japan, vol. 100, 2020, pp. 590–602.
  10. W. Boehmer, V. Kurin, and S. Whiteson, “Deep coordination graphs,” in the 37th International Conference on Machine Learning (ICML 2020), Virtual Event, vol. 119, 2020, pp. 980–991.
  11. N. Naderializadeh, F. H. Hung, S. Soleyman, and D. Khosla, “Graph convolutional value decomposition in multi-agent reinforcement learning,” CoRR, vol. abs/2010.04740, 2020.
  12. S. Li, J. K. Gupta, P. Morales, R. E. Allen, and M. J. Kochenderfer, “Deep implicit coordination graphs for multi-agent reinforcement learning,” in the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021), Virtual Event, United Kingdom, 2021, pp. 764–772.
  13. Q. Yang, W. Dong, Z. Ren, J. Wang, T. Wang, and C. Zhang, “Self-organized polynomial-time coordination graphs,” in International Conference on Machine Learning (ICML 2022), Baltimore, Maryland, USA, vol. 162, 2022, pp. 24 963–24 979.
  14. T. Wang, L. Zeng, W. Dong, Q. Yang, Y. Yu, and C. Zhang, “Context-aware sparse deep coordination graphs,” in the 10th International Conference on Learning Representations (ICLR 2022), Virtual Event, 2022.
  15. A. Pacchiano, J. Parker-Holder, Y. Tang, K. Choromanski, A. Choromanska, and M. Jordan, “Learning to score behaviors for guided policy optimization,” in the 37th International Conference on Machine Learning, (ICML 2020), vol. 119, 13–18 Jul 2020, pp. 7445–7454.
  16. A. Oroojlooy and D. Hajinezhad, “A review of cooperative multi-agent deep reinforcement learning,” Appl. Intell., vol. 53, no. 11, pp. 13 677–13 722, 2023.
  17. R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in the 30th Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017, pp. 6379–6390.
  18. J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), New Orleans, Louisiana, USA, 2018, pp. 2974–2982.
  19. Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE Trans. Neural Networks Learn. Syst., vol. 32, no. 1, pp. 4–24, 2021.
  20. Y. Liu, W. Wang, Y. Hu, J. Hao, X. Chen, and Y. Gao, “Multi-agent game abstraction via graph attention neural network,” in the 34th AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA,, 2020, pp. 7211–7218.
  21. T. Wang, J. Wang, C. Zheng, and C. Zhang, “Learning nearly decomposable value functions via communication minimization,” in the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 2020.
  22. W. Duan, J. Xuan, M. Qiao, and J. Lu, “Learning from the dark: Boosting graph convolutional neural networks with diverse negative samples,” in the 36th AAAI Conference on Artificial Intelligence (AAAI 2022), Virtual Event.   AAAI Press, 2022, pp. 6550–6558.
  23. J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforcement learning,” in 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 2020.
  24. S. Iqbal and F. Sha, “Actor-attention-critic for multi-agent reinforcement learning,” in the 36th International Conference on Machine Learning (ICML 2019), Long Beach, California, USA, vol. 97, 2019, pp. 2961–2970.
  25. T. Wang, H. Dong, V. R. Lesser, and C. Zhang, “ROMA: multi-agent reinforcement learning with emergent roles,” in the 37th International Conference on Machine Learning (ICML 2020), Virtual Event, vol. 119, 2020, pp. 9876–9886.
  26. B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting,” in the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, 2018, pp. 3634–3640.
  27. Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, “Connecting the dots: Multivariate time series forecasting with graph neural networks,” in The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2020), Virtual Event, CA, USA, 2020, pp. 753–763.
  28. V. G. Satorras, S. S. Rangapuram, and T. Januschowski, “Multivariate time series forecasting with latent graph inference,” CoRR, vol. abs/2203.03423, 2022.
  29. T. N. Kipf, E. Fetaya, K. Wang, M. Welling, and R. S. Zemel, “Neural relational inference for interacting systems,” in the 35th International Conference on Machine Learning (ICML 2018), Stockholmsmässan, Stockholm, Sweden, vol. 80.   PMLR, 2018, pp. 2693–2702.
  30. L. Franceschi, M. Niepert, M. Pontil, and X. He, “Learning discrete structures for graph neural networks,” in the 36th International Conference on Machine Learning (ICML 2019), Long Beach, California, USA, vol. 97, 2019, pp. 1972–1982.
  31. C. Shang, J. Chen, and J. Bi, “Discrete graph structure learning for forecasting multiple time series,” in the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, Austria, 2021.
  32. J. Li, C. Hua, J. Park, H. Ma, V. M. Dax, and M. J. Kochenderfer, “Evolvehypergraph: Group-aware dynamic relational reasoning for trajectory prediction,” CoRR, vol. abs/2208.05470, 2022.
  33. K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251–257, 1991.
  34. E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 2017.
  35. C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” in the 5th International Conference on Learning Representations (ICLR 2017),Toulon, France, 2017.
  36. Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” in the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 2018.
  37. W. Duan, J. Lu, Y. G. Wang, and J. Xuan, “Layer-diverse negative sampling for graph neural networks,” Transactions on Machine Learning Research, 2024.
  38. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, April 24-26, 2017.
  39. M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C. Hung, P. H. S. Torr, J. N. Foerster, and S. Whiteson, “The starcraft multi-agent challenge,” in the 18th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS 2019), Montreal, QC, Canada,, 2019, pp. 2186–2188.
  40. C. Guestrin, S. Venkataraman, and D. Koller, “Context-specific multiagent coordination and planning with factored mdps,” in the 18th National Conference on Artificial Intelligence, (AAAI 2002), 2002, pp. 253–259.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.

  • Generalization beyond SMAC: The approach is only validated on StarCraft II; its effectiveness on other cooperative benchmarks (e.g., MPE, GRF, traffic control), continuous-control tasks, and real-world-like simulators remains untested.
  • Scaling to very large agent populations: Empirical scalability is shown up to ~30 agents; performance, runtime, and memory behavior for 100+ agents (or swarms) is unknown.
  • Variable/dynamic agent sets: The method assumes a fixed set of agents; handling agents entering/leaving or variable N across episodes is not addressed.
  • Dependence on privileged global state s_t: Infer-Present relies on access to the centralized state during training; many environments do not expose a reliable global state, and alternatives (self-supervised or proxy targets) are not explored.
  • Trajectory content limitation: The trajectory encoder uses only observation histories; it ignores action and reward histories that could capture causal and coordination structure more directly.
  • Predict-Future training–inference gap: The forecasting loss uses future observations during training (teacher forcing), but the paper does not analyze potential leakage or mismatch at test time when future data are unavailable; impacts on stability and policy improvement are unknown.
  • Absence of theoretical guarantees: There are no results on identifiability of the latent graph, convergence of joint graph–policy learning, or guarantees that the inferred graph improves value factorization or team return.
  • Sparsity control is unspecified: Although the method “samples a sparse graph,” there is no explicit sparsity constraint/regularizer or expected-degree prior; how sparsity is calibrated and its impact on performance remains unclear.
  • Independent-edge assumption: Edges are sampled independently from Bernoulli parameters, ignoring higher-order dependencies and group interactions; the potential gains of structured priors, hypergraphs, or motif constraints are unexplored.
  • Static structure vs. dynamic topology: The paper suggests sampling a structure from trajectories and then only updating edge weights online; whether resampling structure per episode/interval or fully dynamic topology per step yields better coordination remains open.
  • Replay and nonstationarity: Graphs stored in the replay buffer evolve during training; the paper does not analyze off-policy bias, distribution shift, or the need for corrections (e.g., importance weights) when training with stale graphs.
  • Runtime and memory profiling: Claims of better complexity are not supported with wall-clock time, FLOPs, or memory usage comparisons; the scaling impact of N, trajectory length T, diffusion degree K, and GNN depth is unquantified.
  • Communication constraints not modeled: Message size, bandwidth limits, latency, and packet loss are ignored; the approach’s performance under communication budgets or noise is unknown.
  • Robustness: Sensitivity to observation noise, occlusions, stochasticity, adversarial or non-cooperative agents, and distribution shifts is not evaluated; no mechanisms for online adaptation of the graph under drift are assessed.
  • Hyperparameter sensitivity: No systematic study of the Gumbel temperature schedule s, graph-loss weight λ (beyond a narrow sweep), trajectory length T, diffusion degree K, or GNN depth; practical tuning guidance is missing.
  • Mixed/competitive settings: The method is tailored to fully cooperative Dec-POMDPs; extension to mixed-motive or competitive MARL (e.g., signed relations, antagonistic edges) is an open question.
  • Interpretability and validation of learned graphs: There is no analysis of whether learned edges align with ground-truth interaction patterns (e.g., proximity, line-of-sight, role complementarities) or contribute to human-understandable strategies.
  • Baseline breadth: Comparisons omit several strong baselines (e.g., MAAC, QPLEX, QTRAN, FACMAC, transformer-based MARL); comparative conclusions may not hold against these alternatives.
  • Encoder choice for trajectories: The observation-history extractor is convolutional; whether transformers, gated RNNs, or contrastive sequence encoders yield better graphs is untested.
  • Uncertainty quantification: Relation “uncertainty” is represented by point estimates of Bernoulli parameters; Bayesian graph learning, ensembles, or risk-sensitive policies to propagate graph uncertainty are not investigated.
  • Compatibility with QMIX monotonicity: Adding message features m_i to per-agent Q_i may interact with QMIX’s monotonic mixing constraint; the representational limits and credit-assignment implications are not analyzed.
  • Edge symmetry and constraints: It is unclear whether adjacency is constrained to be symmetric or acyclic; the effects of enforcing symmetry, degree bounds, or stability constraints are unexplored.
  • Evaluation metrics: Results focus on win rate with limited seeds; no statistical tests, sample-efficiency metrics, compute–performance trade-offs, or ablations on computational overhead are reported.
  • Transfer and multi-task learning: Whether the inferred graph transfers across maps/tasks, and how to adapt it with few samples in new scenarios, remains unaddressed.
  • Safety and failure modes: There is no analysis of failure cases where incorrect edges induce harmful coordination; detection, recovery, or conservative graph updates are not considered.
  • Automated hyperparameter selection: The method lacks procedures for automatic or robust hyperparameter tuning of graph learning and GNN components across tasks.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.