Hierarchical Graph Structures in MARL

Updated 28 April 2026

Hierarchical graph structures in MARL are graph-based abstractions that decompose agent coordination into layered roles, clusters, and subtasks.
They employ dynamic graph operators and GNN-based message-passing to adapt communication paths and optimize credit assignment in complex environments.
Empirical findings show that these architectures enhance scalability, interpretability, and transferability across diverse, partially observable and sparse-reward tasks.

Hierarchical graph structures in multi-agent reinforcement learning (MARL) are graph-based abstractions that encode relational, compositional, and control hierarchies among agents and their tasks. In contrast to flat communication or monolithic joint policies, these architectures introduce explicit multi-level topologies—where agent roles, clusters, or specialized modules are dynamically composed, enabling more efficient coordination, credit assignment, and integration of action abstractions or domain priors. Such hierarchical graphs underpin many state-of-the-art approaches for scaling MARL to large, complex, partially observable, or sparse-reward domains.

1. Formal Definitions and Core Graph Constructs

Hierarchical graph structures in MARL formalize the agent system as a directed acyclic graph (DAG) or layered multi-partite graph, where vertices denote agents, clusters, skills, roles, or subtasks, and edges encode superior-subordinate relations, communication paths, or delegation of goals and options.

Examples of Formalisms

Extensible Cooperation Graph (ECG) (Fu et al., 2024): The ECG at time $t$ is $G_t = (\mathcal{V}, \mathcal{E}_t)$ , where nodes are partitioned into agent nodes $\mathcal{V}_{\mathrm{agent}}$ , cluster nodes $\mathcal{V}_{\mathrm{cluster}}$ , and target nodes $\mathcal{V}_{\mathrm{target}}$ . Edges $\mathcal{E}_{\mathrm{AC}}(t)$ connect agents to clusters, and $\mathcal{E}_{\mathrm{CT}}(t)$ connect clusters to targets (either primitive or cooperative action nodes).
Hierarchical Message-Passing Graph (Marzi et al., 31 Jul 2025): A three-level directed acyclic graph $G_t = (V_t, E_t)$ with manager, sub-manager, and worker nodes. Edges include downward hierarchical links (goal delegation) and lateral intra-level communication (GNN-based message-passing).
Reinforcement Networks (Kryzhanovskiy et al., 28 Dec 2025): The MARL system is a DAG $G = (V, E)$ , where vertices represent agents that exchange messages, rewards, and instructions according to edge relations; each agent’s policy depends on inputs/outputs from adjacent graph vertices.
Cooperation Graph (CG) (Fu et al., 2022): A three-layer graph $G_{\mathrm{CG}}(t)$ with agent, cluster, and target nodes, split into bipartite Agent–Clustering and Cluster–Designating subgraphs.
Hierarchical Reward Machine Graphs (Zheng et al., 2024): Nodes represent reward machines (finite-state automata specifying task subtasks), and edges encode subtask-decomposition in a hierarchical DAG, forming a "graph of graphs."

Each formalism enables modularity, explicit action abstraction, and hierarchical decomposition.

2. Hierarchical Graph Operations and Dynamic Topology Adaptation

Hierarchical graph structures are not static; their topology adapts on-line to reflect cooperative organization, delegation assignments, or subtask progress.

Graph Operators (HCGL, ECG) (Fu et al., 2024): Four operator modules control graph rewiring—agent–cluster operators $G_t = (\mathcal{V}, \mathcal{E}_t)$ 0 reassign agents between clusters; cluster–target operators $G_t = (\mathcal{V}, \mathcal{E}_t)$ 1 rewire cluster assignments to targets (primitive or cooperative). These operator policies are trained by MARL (e.g., MAPPO), and action masking forbids invalid rewiring (e.g., empty source).
Manager-Worker Delegation (Feudal HRL, Message Passing) (Marzi et al., 31 Jul 2025): The hierarchy is dynamically instantiated at each time step, with high-level nodes assigning sub-goals to lower levels; lateral GNN links enable coordination within each level. Temporal abstraction arises by updating higher levels less frequently.
Adaptive Grouping and Routing (Sheng et al., 2020): Agent "communication weights" (from auxiliary RL or DQN) guide dynamic election of cluster leaders and cluster membership, shaping the two-level communication topology.
Hierarchical Reward Machine Decomposition (Zheng et al., 2024): The hierarchy is traversed (depth-first) during execution, with each RM node delegating subproblems as its lower-level edges become enabled, and higher-level state transitions trigger new subtask calls.

This dynamic adaptation allows rapid reaction to environmental changes or transitions between subtasks.

3. Integration of Action and Abstraction Hierarchies

Hierarchical graph structures unify primitive and high-level actions or strategies in the agent system, enabling integration of domain knowledge and flexible policy composition.

Unified Action-Space via Target Nodes (ECG) (Fu et al., 2024): Target nodes correspond either to primitive actions (one-to-one with $G_t = (\mathcal{V}, \mathcal{E}_t)$ 2) or to cooperative, parameterized group actions. Upon assignment, cluster members receive either a direct primitive action or the output of a pre-coded translator for the cooperative macro-action.
Skill Graphs for Multi-Task Transfer (Zhu et al., 9 Jul 2025): A skill graph defines relations between environments, tasks, and "skills" (low-level MARL policies) embedded in a joint space. The high-level graph selects, blends, or fine-tunes skills for a given task/environment tuple, independent of the low-level RL. Blending is performed via continuous scoring and weighted composition of candidate skills.
Latent-Strategy Conditioning (Ibrahim et al., 2022): Hierarchical latent policies sample both individual ( $G_t = (\mathcal{V}, \mathcal{E}_t)$ 3) and relational ( $G_t = (\mathcal{V}, \mathcal{E}_t)$ 4) latent variables that are injected into local $G_t = (\mathcal{V}, \mathcal{E}_t)$ 5 functions or a shared team-level $G_t = (\mathcal{V}, \mathcal{E}_t)$ 6, enabling explicit stratagems and social plans generated via graph attention.
Macro-actions via Cluster-Actions (Fu et al., 2022): Cluster nodes invoke user-defined macro-actions for their groups, implementing domain-specific behaviors such as formation control, pursuit, or interception.

This explicit modularity and action abstraction reduce the effective search space and improve sample efficiency.

4. Learning, Optimization, and Credit Assignment in Hierarchical Graphs

Hierarchical MARL frameworks leverage decentralized or semi-centralized optimization protocols which track graph topology.

Graph-Operator as Agents (HCGL, ECG) (Fu et al., 2024, Fu et al., 2022): The operators controlling the graph structure are themselves treated as agents, optimized in a standard multi-agent actor-critic (e.g., MAPPO with PPO and entropy loss, with a cross-entity encoder).
Reward Decomposition and Advantage Propagation (Marzi et al., 31 Jul 2025): Levels are trained to maximize advantage functions computed from upper-level goals, and reward assignment is temporally and hierarchically aligned to the actual option intervals. Theoretical results show consistency with the global return under the constructed hierarchy.
Independent Graph-Component Optimization (Reinforcement Networks) (Kryzhanovskiy et al., 28 Dec 2025): Each agent vertex's policy, message, and reward function is trained as an independent MDP, using only local inputs from its subgraph. Proxy rewards and communication signals propagate upward, enabling local credit assignment.
Auxiliary Objectives for Graph Structure Retention and Predictive Planning (Ibrahim et al., 2022): Reconstruction errors, mutual information criteria, and reward prediction losses are incorporated to ensure that hierarchical graph latents and structure are informative, predictive, and aligned with future returns.

This modularity in training leverages graph decomposition for scalability, interpretable policy learning, and efficient credit assignment across compositional hierarchies.

5. Empirical Findings and Comparative Performance

Hierarchical graph-structured MARL methods consistently outperform flat and non-hierarchical baselines across a range of benchmarks:

Framework / Task	Key Result(s)
HCGL (CSI-27/3/9, sparse-reward)	Solves the benchmark (success 0.97±0.03); zero-shot transfer to larger scales (success 0.77–0.65), fine-tuned to 0.95 (Fu et al., 2024).
HiMPPO (LBFwS, VMAS, SMACv2)	Final return ≈450 on LBFwS-Hard (vs MAPPO 120, IPPO 80). SMAC win rates comparable/superior to MAPPO, with dynamic graph outperforming static (Marzi et al., 31 Jul 2025).
CG-MARL (AII, HCT benchmarks)	Near-100% success after 400K episodes in AII (flat methods fail), optimal convergence on HCT with optimal cluster count (Fu et al., 2022).
Skill Graph (Real/Sim transfer)	100% high-level decision accuracy in complex multi-stage tasks and invariant to lower-level MARL skill library (Zhu et al., 9 Jul 2025).
Soft-HGRN (Scalability)	Outperforms DQN, CommNet, MAAC, DGN by large margin at scale; ablation shows sharp drop without hierarchical graph attention or temporal memory (Ye et al., 2021).
MAHRM (Pass, concurrent events)	Outperforms flat and independent RM decompositions, avoids exponential blowup; achieves strong gains in tightly coupled, concurrent domains (Zheng et al., 2024).

In ablations, design points such as cluster count, action granularity, or hierarchical message-passing critically determine performance, demonstrating the need for well-matched graph structures to the MARL domain.

6. Scalability, Interpretability, and Transfer

Hierarchical graphs facilitate both scalability and interpretability:

Scalability: By reducing the dimension of the effective control or communication space (e.g., through grouping, clustering, or macro-action abstraction), methods such as HCGL and Soft-HGRN maintain robust performance up to hundreds of agents or tasks of increasing complexity (Fu et al., 2024, Ye et al., 2021).
Interpretability: Explicit graph nodes (clusters, skills, subtasks) and learned attention weights expose which agents, groups, or strategies are active at any point. For example, in HAMA (Ryu et al., 2019), attention visualization reveals strategic shifts (e.g., pincer vs. solo pursuit).
Transferability: Fixed-dimension graph-based representations (HGATs, skill embeddings) support seamless transfer of policies across varying agent counts, scenario complexity, and unrelated tasks, with empirical confirmation (Zhu et al., 9 Jul 2025, Ryu et al., 2019).

7. Open Directions and Practical Considerations

Recent research has highlighted several open directions:

Automated Hierarchical Structure Design: Optimal topology search for graph composition (DAG or multi-level) remains unresolved (Kryzhanovskiy et al., 28 Dec 2025).
Communication and Proxy-Reward Learning: Improved auxiliary tasks, multi-objective optimization, and GNN-based exploration strategies are open for investigation (Kryzhanovskiy et al., 28 Dec 2025).
Hierarchical Curriculum and Graph Morphogenesis: Progressive construction or adaptation of the hierarchical graph during training (compositional curricula) offers a path to scalability and continual learning.
Integration with LLMs: Using LLMs as adaptable graph agents for instruction, communication, and logical coordination is proposed as an avenue for future MARL architectures (Kryzhanovskiy et al., 28 Dec 2025).
Non-cooperative and Mixed-motivation Settings: Most hierarchical graph frameworks focus on cooperation; generalizing these models to mixed or adversarial settings with compositional task decomposition remains an area of development.

Hierarchical graph structures are central to the next generation of MARL, providing a common foundation for modular, scalable, and interpretable agent coordination across both simulated and real-world environments.