Hierarchical Multi-agent Reinforcement Learning

Updated 21 April 2026

Hierarchical Multi-agent Reinforcement Learning is a framework that organizes agents into multiple levels using temporal, task, and role abstractions to address scalability and coordination challenges.
It leverages techniques like value decomposition, centralized training, and intrinsic rewards to enhance policy learning across high-level and low-level controllers.
Empirical results in domains such as StarCraft II and drone control demonstrate HMARL's superior sample efficiency, stability, and transferability compared to flat MARL.

Hierarchical Multi-agent Reinforcement Learning (HMARL) refers to a class of methodologies that incorporate temporal, structural, or functional abstraction into multi-agent reinforcement learning (MARL). By organizing agents, their policies, or the task itself into multiple levels, HMARL addresses the scalability, coordination, and exploration challenges endemic to flat MARL in high-dimensional, compositional, or long-horizon environments. This article surveys principal formulations, algorithmic mechanisms, theoretical insights, and empirical results for hierarchical MARL, with coverage of both value-based and policy-gradient approaches.

1. Hierarchical Problem Formulations and Abstractions

HMARL approaches support several hierarchical abstractions:

Temporal abstraction: Multi-level agent policies select temporally extended options or skills at high frequency, and execute low-level (primitive) actions at higher frequency. Each agent maintains a stack of policies $\{\pi^{h},\pi^{l}\}$ , e.g., $\pi^{h}$ selects a macro-action, or option, which is then executed by the low-level $\pi^{l}$ via primitive actions for $k$ environment steps (Xu et al., 2021, Yang et al., 2019).
Task/subtask or state abstraction: The overall cooperative task is decomposed into subtasks or goals, enabling modular learning or recursive task decomposition, frequently specified by logical or automata-based structures such as reward machines (RM) or LTL (linear temporal logic) specifications (Zheng et al., 2024, Liu et al., 2024). The assignment of agents or agent groups to subtasks may itself be dynamically learned.
Role or grouping abstraction: Agents may be clustered dynamically—e.g., into teams, pairings, or clusters—by a high-level grouping policy; within each group, agents coordinate on local primitives (Hu, 11 Jan 2025, Fu et al., 2024).
Plan/action space abstraction: The action space is compressed via high-level plan selection, with lower levels handling plan instantiation by either RL, collective learning, or model-based control (Qin et al., 22 Sep 2025, Studt et al., 19 Sep 2025).

Formally, a hierarchical MARL system can be described by a hierarchical Markov (or Semi-Markov) Game $\mathcal{G}$ , equipped with high-level and low-level action spaces $\mathcal{A}^{H},\mathcal{A}^{L}$ , state spaces, and reward structures. Temporal abstraction leads to a two-timescale stochastic process or a semi-MDP (Yang et al., 2019, Selmonaj et al., 13 May 2025).

2. Core Algorithmic Mechanisms and Architectures

2.1 Two-level (and deeper) Policy Hierarchies

Canonical architectures involve two levels:

High-level (macro) controller $\pi^{h}$ : Issues skills, options, or abstract goals at intervals; trained via centralized critics (CTDE) or multi-agent value decomposition (QMIX/VDN).
Low-level (primitive) controller $\pi^{l}$ : Executes agent-level primitives conditioned on the current macro-command, possibly via independent or parameter-shared agents. Learning is performed either independently, or using local critics, with intrinsic rewards guiding skill acquisition (Yang et al., 2019, Xu et al., 2021).

Deeper hierarchies are supported by generalized frameworks (e.g., TAG's LevelEnv (Paolo et al., 21 Feb 2025)), with modularity that allows heterogeneous learners, arbitrary depth, and interoperability between levels.

2.2 Value Decomposition and Centralized Training

Credit assignment in cooperative HMARL utilizes value decomposition (e.g., QMIX, VDN) at both macro and micro levels (Xu et al., 2021). Centralized critics at each level observe joint observations and actions during training, while execution remains decentralized.

2.3 Dual/Intrinsic Reward Structures and Advantage Guidance

To ensure inter-level synergy, intrinsic reward signals at lower levels are shaped using the high-level advantage $A_h$ (e.g., $r^i_t=A_h(s_T,u^h_T)/k$ ), aligning low-level policy updates with high-level objectives and stabilizing the upward flow of credit (Xu et al., 2021, Marzi et al., 31 Jul 2025).

2.4 Graph- and Message-Passing-based Hierarchies

Message-passing architectures leverage communication graphs for intra-level coordination and hierarchical assignment of goals, combined with advantage-based reward mechanisms across levels (Marzi et al., 31 Jul 2025, Fu et al., 2024). Extensible cooperation graphs (ECGs) dynamically group agents and assign them to cluster-based primitives or cooperative behaviors, with hierarchical graph operators optimizing graph rewiring for efficient scaling (Fu et al., 2024).

3. Theoretical Insights and Algorithmic Benefits

3.1 Sample Efficiency and Scalability

Hierarchical decomposition constrains the effective joint decision space, often reducing it from exponential to polynomial (or even linear) complexity in the number of levels and agents per cluster. Analyzed complexity for permutation-invariant grouping (Hu, 11 Jan 2025), reward machine decomposition (Zheng et al., 2024), and plan/grouping abstractions (Qin et al., 22 Sep 2025) shows exponential savings over flat MARL.

3.2 Monotonic Improvement and Stable Optimization

The use of advantage-weighted intrinsic rewards and off-policy value targets provides monotonic improvement guarantees under policy iteration, given only one level is updated at a time (Xu et al., 2021). Sequential update schemes across nested critic hierarchies further prevent destructive gradient interference and improve convergence (Eckel et al., 25 Feb 2026).

3.3 Handling Non-Stationarity and Credit Assignment

Hierarchical reward assignment, by propagating high-level advantage signals to low-level policies, aligns short-term execution with long-term goals. This mitigates the temporal credit assignment problem and stabilizes concurrent inter-agent and inter-level optimization (Xu et al., 2021, Marzi et al., 31 Jul 2025).

4. Empirical Applications, Benchmarks, and Results

4.1 Cooperative Multi-Agent Games

HMARL methods consistently outperform flat MARL baselines on canonical benchmarks such as StarCraft II micromanagement (SMAC) and Google Research Football (GRF), especially in hard/super-hard tasks where credit assignment and long-horizon coordination are critical (Xu et al., 2021, Ibrahim et al., 2022). In SMAC's MMM2 and 27m_vs_30m, HAVEN achieves 100% and >95% win rates versus ~60–70% for QMIX (Xu et al., 2021); HISMA solves all super-hard SCII scenarios at >99% win (Ibrahim et al., 2022).

4.2 Realistic Control and Industrial Domains

Safe navigation in road networks, predator-prey, or drone escort is addressed by hierarchical frameworks combining MARL/MPC or MARL/CBF (control barrier functions) architectures. These approaches achieve near-perfect safety and high task success, substantially surpassing independent or mean-field MARL in episode length and energy cost (Studt et al., 19 Sep 2025, Ahmad et al., 20 Jul 2025).

4.3 Large-Scale and Sparse-Reward Environments

Self-clustering and graph-operator-based models (e.g., HCGL (Fu et al., 2024)) exhibit robust scaling to hundreds of agents, with zero-shot transfer and high success rates in large sparse-reward swarm tasks (e.g., >0.9 final success in CSI-216/24/9, where value decomposition baselines fail).

4.4 Healthcare and Cyber-Physical Systems

Hierarchical MARL has been applied to multi-organ clinical decision-making (Tan et al., 2024) and cyber network defense (Singh et al., 2024). Decomposing action and observation spaces into clinical (organ-specific) agents yields a 45.9% reduction in estimated sepsis mortality versus clinician baselines, while hierarchical PPO in cyber defense accelerates learning, increases precision, and reduces false positives relative to flat or centralized MARL.

5. Design Trade-offs, Limitations, and Future Directions

5.1 Abstraction Choice and Depth

Selecting the appropriate abstraction (e.g., cluster-size, depth of hierarchy, primitives versus learned skills) is task-dependent. Overly coarse or fine abstractions can slow learning or impair expressivity (Fu et al., 2024, Paolo et al., 21 Feb 2025).

5.2 Hand-Crafted versus Learned Decompositions

Most frameworks require hand-designed skills, reward functions, or subtask hierarchies, though recent advances attempt to automate the discovery of task decompositions (e.g., via unsupervised skill discovery, reward machine induction, or flexible graph operators) (Yang et al., 2019, Zheng et al., 2024).

5.3 Stability and Coordination

Joint optimization of deep or non-linear hierarchies remains challenging; sequential or curriculum-based training, and specialized reward assignment, are used to maintain stability (Xu et al., 2021, Eckel et al., 25 Feb 2026, Selmonaj et al., 13 Oct 2025). Fully decentralized learning across all levels without central critics is enabled by frameworks like TAG (Paolo et al., 21 Feb 2025).

5.4 Generalization and Transfer

HMARL approaches demonstrate notable transfer capabilities, e.g., in cyber defense, transferring sub-policies using fine-tuning (Singh et al., 2024), or via zero-shot adaptation to increases in team size or environmental complexity (Fu et al., 2024).

5.5 Open Questions

Key open research questions include scaling guarantees for deep and asynchronous hierarchies, explicit communication protocols within learned groupings, dynamic or learned abstraction selection, safety/fairness constraints, and extensions to partially observable or adversarial environments.

6. Summary Table: Core HMARL Approaches and Features

Method/Paper	Abstraction	Training	Coordination	Key Benchmarks
HAVEN (Xu et al., 2021)	2-level QMIX,	CTDE, VDN/QMIX	Dual reward/advantage	SMAC, GRF
TAG (Paolo et al., 21 Feb 2025)	Arbitrary-depth	Decentralized	LevelEnv, local reward	MPE–Spread, VMAS–Balance
HCGL (Fu et al., 2024)	Clustering, ECG	CTDE, MAPPO	Graph operator agents	Swarm Interception
HRCL (Qin et al., 22 Sep 2025)	Plan/grouping	PPO + DCL	EPOS collective opt.	Energy, Drone, Synthetic
HLC (Eckel et al., 25 Feb 2026)	Nested critics	CTDE, SAC	Sequential updates	SimpleSpread, Escort
MHLRS (Liu et al., 2024)	LTL logic, 2-lvl	DQN, LTL	Value iteration shape	Minecraft-GRID
HMARL-CBF (Ahmad et al., 20 Jul 2025)	Skill, CBF	QMIX/PPO	QP, pointwise safety	MetaDrive
RM hierarchy (Zheng et al., 2024)	Reward machines	Option-based	Automata, subtask assign	Navigation, MineCraft
HISMA (Ibrahim et al., 2022)	Latent strat., 2-lvl	QMIX + info plan	Graph attention	SMAC, GRF