Hierarchical Multi-Agent Reinforcement Learning

Updated 28 October 2025

Hierarchical multi-agent reinforcement learning is a framework that decomposes complex decision-making into multiple levels, using high-level controllers (meta-controllers) and low-level skill policies.
It leverages architectures like centralized training with decentralized execution and dynamic termination to enhance sample efficiency, safety, and interpretability in environments from air combat to logistics.
Empirical evidence shows HMARL improves performance metrics such as win rates, profit margins, and safety guarantees, making it suitable for diverse applications including tactical decision-making and resource management.

Hierarchical Multi-Agent Reinforcement Learning (HMARL) refers to the organization of multi-agent decision-making processes into multiple levels of temporal or structural abstraction. In this paradigm, agents operate not only at the primitive action level but also over extended macro-actions, skills, or roles that are coordinated by higher-level controllers. HMARL architectures decompose the complexity inherent in multi-agent systems—such as combinatorial action spaces, partial observability, heterogeneity, and coordination requirements—by introducing clear modular separations of responsibility and facilitating scalable, sample-efficient, and interpretable policy learning. Recent research demonstrates that HMARL frameworks are effective in domains ranging from air combat and energy market arbitrage to distributed scheduling, cyber-physical system security, and medical decision support.

1. Hierarchical Architectures: Abstractions, Layering, and Policy Decomposition

Hierarchical architectures in multi-agent reinforcement learning instantiate multiple policy levels aligned with distinct temporal or functional resolutions. Typical topologies include a two-level structure comprising:

High-Level Controllers ("Commander" or "Meta-Controller" policies): These agents operate on temporally extended time-scales or over aggregate/abstracted state representations. Actions at this level correspond to options, skills, or macro-actions, which may include mission-phase selections (e.g., attack/defend/engage in air combat (Selmonaj et al., 13 Oct 2025, Selmonaj et al., 13 May 2025)), goal assignment (e.g., warehouse zone allocation (Krnjaic et al., 2022)), cluster formation and task distribution (e.g., in cooperation graphs (Fu et al., 26 Mar 2024)), or trajectory-level strategies (e.g., mixture agents for organ treatments (Tan et al., 6 Sep 2024)).
Low-Level Controllers ("Worker" or "Skill" policies): These policies execute fine-grained control, often conditioned on high-level commands or contexts. Examples include combat maneuver execution in simulated aircraft (Selmonaj et al., 13 Oct 2025), power control during spatial reuse in WLANs (Yu et al., 17 Jun 2025), or safety-critical actuation subject to control barrier functions (Ahmad et al., 20 Jul 2025).

Variants may introduce additional hierarchical layers (e.g., secondary clusters, leaf agents, mixture agents (Tan et al., 6 Sep 2024, Fu et al., 26 Mar 2024)), or include dynamically evolving hierarchies—as in extensible cooperation graphs, where graph operators adaptively rewire agent-cluster-target relations in response to observations and task demands (Fu et al., 26 Mar 2024).

Policy Assignment and Specialized Coordination

Many frameworks implement parameter sharing within agent types or sub-policies to leverage symmetries and accelerate learning, as seen in air combat and warehouse logistics domains (Selmonaj et al., 13 May 2025, Krnjaic et al., 2022). Others explicitly segregate learning signal propagation: high-level rewards are decomposed and delivered as auxiliary or intrinsic signals to guide low-level policy optimization (e.g., advantage-based rewards in HAVEN (Xu et al., 2021)). Clusters or subgroups of agents are increasingly assigned via task- or topology-driven dynamic grouping (e.g., self-clustering in ECG (Fu et al., 26 Mar 2024) or plan grouping in HRCL (Qin et al., 22 Sep 2025)).

2. Mathematical and Algorithmic Foundations

Markov Game and Hierarchical Options

The formal foundation of HMARL is typically a hierarchical Markov game or a Partially Observable Semi-Markov Game (POSMG) (Selmonaj et al., 13 Oct 2025), where the agent policy is factored as a composition of temporally extended decision-making and primitive-level action selection:

$\mathcal{G}_s = (\mathcal{N}, \mathcal{S}, \{\mathcal{C}^i\}, \{\mathcal{Z}^i\}, \{Z^i\}, \mathcal{P}, \rho, \{\mathcal{R}^i\}, \gamma)$

$a^i_t = \pi^i_\text{lo}(o^i_t, z^i_t)$

$z^i_{t_k} \sim \pi^i_\text{hi}(o^i_{t_k})$

where high-level policy $\pi^i_\text{hi}$ selects option/skill $z^i_{t_k}$ at decision epoch $t_k$ , and low-level policy $\pi^i_\text{lo}$ outputs primitive actions according to the active option.

Training Algorithms

Common training strategies invoke Centralized Training with Decentralized Execution (CTDE), often using actor-critic variants with mixing networks (e.g., QMIX, VDN) at both levels (Xu et al., 2021, Yang et al., 2019). Losses are constructed to factor value functions and propagate credit appropriately across agents and time scales; for instance, the HAVEN framework uses a high-level advantage function distributed as an intrinsic reward to low-level learners (see equations and pseudocode in (Xu et al., 2021)).

More recent works leverage trust-region policy optimization (e.g., MA-SPO (Selmonaj et al., 13 Oct 2025)), policy regularization techniques, or deep deterministic policy gradients (e.g., HMADDPG for market arbitrage (Zhang et al., 22 Jul 2025)). HMARL approaches for constraint-heavy or safety-critical domains, such as (Ahmad et al., 20 Jul 2025), solve constrained quadratic programs (CBF-QPs) at each low-level action step to guarantee pointwise safety.

Dynamic Termination

A key challenge is the termination of options in a multi-agent setting. The dynamic termination Bellman equation extends the Q-function to include explicit termination decisions and penalties, balancing the trade-off between flexibility and predictability in intent broadcast (Han et al., 2019):

$Q^j(s_t, o_{t-1}^{-j}, o^j) := \begin{cases} \mathbb{E} \left[ r_t + \gamma \max_{o'^j \in \{o^j, T\}} Q^j(s_{t+1}, o_t^{-j}, o'^j) \right], & o^j \neq T \ \max_{o'^j \in \mathcal{O}^j} Q^j(s_t, o_{t-1}^{-j}, o'^j) - \delta, & o^j = T \end{cases}$

Hierarchy of Reward Machines

In cooperative domains with high event interdependence, the hierarchy of Reward Machines (MAHRM) (Zheng et al., 8 Mar 2024) decomposes global objectives into hierarchical RM states, where each sub-proposition or high-level event is handled by a policy operating over a subset of agents; learning is recursive and policies for composite events coordinate via assignment options at each RM state.

3. Core Applications and Empirical Evidence

Tactical and Strategic Decision-Making

The hierarchical decomposition is especially impactful in multi-agent tactical environments. In 3D air combat simulation (Selmonaj et al., 13 Oct 2025, Selmonaj et al., 13 May 2025), agents are grouped into heterogeneous teams (e.g., F16/A4), with low-level policies trained via curricula on tracking/adversarial engagement, and high-level commanders learning to sequence tactical options via league play. Hierarchical agents attain $>80\%$ win rates in 10-vs-10 combat, whereas non-hierarchical agents fail to coordinate.

Distributed Logistics and Resource Management

In warehouse order-picking, hierarchical MARL managers assign spatial workloads (zones) to workers (robots/humans), reducing the worker action space and easing congestion—yielding higher pick rates compared to classical heuristics or flat MARL (Krnjaic et al., 2022).

In local energy markets, HMARL enables aggregators to synchronize bidding strategies across electricity and flexibility markets, achieving up to $40.6\%$ profit increase over independent MARL baselines via coordinated arbitrage (see (Zhang et al., 22 Jul 2025), Table 1).

Safety-Critical Systems

HMARL with integrated CBFs rigorously guarantees pointwise safety in multi-agent navigation tasks, outperforming alternative methods by achieving near-perfect ( $\geq 99\%$ ) safety/success rates in congested environments (Ahmad et al., 20 Jul 2025). In cyber-physical system security, hierarchical defenders coordinated by a central agent exhibit superior detection F1-scores and operational continuity under adaptive attack scenarios as compared to flat MARL or rule-based baselines (Alqithami, 12 Jun 2025).

Scalability, Transfer, and Knowledge Integration

Self-clustering via explicit cooperation graphs (Fu et al., 26 Mar 2024) enables fault-tolerant, interpretable, and knowledge-augmented HMARL for large-scale (hundreds of agents) sparse-reward tasks. The explicit hierarchical structure supports zero-shot transfer and rapid curriculum scaling to larger swarm sizes. In combinatorial optimization (assignment, plan selection), hybrid frameworks combine high-level MARL for abstract partitioning and low-level decentralized collective learning for coordinated selection, facilitating tractable learning in high-dimensional plan spaces (Qin et al., 22 Sep 2025).

4. Coordination, Communication, and Reward Design

Explicit and Implicit Coordination

Hierarchical frameworks promote coordination both via explicit mechanisms—such as meta-controller guidance in scheduling (controller pairs and constraints (Kumar et al., 2017)), or inter-agent communication channels in multi-organ medical decision support (Tan et al., 6 Sep 2024)—and implicit reward design (e.g., dual advantage-based rewards (Xu et al., 2021), mutual information regularization for strategy identifiability (Ibrahim et al., 2022)).

Reward machines (RM) hierarchies encode both option/subtask completions and event dependencies, while fairness and throughput are jointly optimized in WLAN coordination via log-inverse historic throughput terms and multi-objective critics (Yu et al., 17 Jun 2025).

Interpretability and Knowledge Integration

By making the hierarchical structure, node assignments, and reward decomposition explicit, frameworks such as ECG (Extensible Cooperation Graph) facilitate direct integration of domain knowledge via cooperatively programmed macro-actions, support interpretability (graph visualization), and enable efficient exploration even under reward sparsity (Fu et al., 26 Mar 2024).

5. Exploiting Hierarchy for Scalability, Sample Efficiency, and Robustness

Sample Efficiency and Convergence

Hierarchical decomposition consistently leads to orders-of-magnitude improvements in sample efficiency and convergence rates in empirical studies. In air combat, curriculum and hierarchy accelerate learning of combat skills, allowing scale-up to 15-vs-15 teams (Selmonaj et al., 13 May 2025); in warehouse logistics, hierarchical MARL achieves higher pick rates faster than flat MARL (Krnjaic et al., 2022).

Scalability

Hierarchical structures decouple agent decisions, reduce joint action/state space complexity, and permit parallel/distributed policy learning (e.g., block-decentralized LQR via per-group RL (Bai et al., 2020)). Transfer learning is drastically improved: ECG-based HMARL policies are directly portable to systems several times larger than their training environment, with minor performance degradation and rapid fine-tuning recovery (Fu et al., 26 Mar 2024).

Robustness and Adaptation

Hierarchical frameworks, especially those incorporating adversarial elements and meta-controllers, foster resilience against evolving threats (CPS security (Alqithami, 12 Jun 2025)), high-dimensional plan drift (HRCL (Qin et al., 22 Sep 2025)), and dynamic resource constraints (OpenRAN handover (Giarrè et al., 11 Mar 2025)).

6. Limitations and Active Research Frontiers

Despite empirical successes, current HMARL frameworks exhibit several open challenges:

Termination Synchronization and Predictability: Balancing option switching flexibility and commitment for intention sharing, especially under partial observability, remains a nuanced issue (Han et al., 2019).
Communication Bottlenecks: Explicit communication among agents, while enhancing coordination, can incur overhead in large-scale systems; research is ongoing on selective or attention-based message-passing (Ryu et al., 2019).
Safety Assurance: While CBF-based approaches offer guarantees, integrating robust, real-time constraint enforcement in highly dynamic or non-linear settings is an open direction (Ahmad et al., 20 Jul 2025).
Reward Shaping and Transferability: Necessity for domain-specific reward engineering persists in some applications, although mechanisms such as reward machine hierarchies and knowledge-augmented actions offer promising flexibility.
Generalization Beyond Training Distribution: Policies trained under fixed topologies or agent sets may face difficulties in rapid adaptation to new tasks; however, curriculum learning and explicit graph-based hierarchies have shown promise for scalable generalization (Fu et al., 26 Mar 2024, Selmonaj et al., 13 Oct 2025).

7. Representative Frameworks and Comparisons

Paper	Hierarchical Structure	Domain/Task	Notable Metric/Gain
(Selmonaj et al., 13 Oct 2025)	Two-level (commander/maneuver)	Heterogeneous air combat	$>80\%$ win rate
(Zhang et al., 22 Jul 2025)	Aggregator: primary/secondary sub-agents	Local energy market arbitrage	$+40.6\%$ profit
(Krnjaic et al., 2022)	Manager/worker agents	Warehouse logistics (PTG/GTP)	Highest pick rates
(Fu et al., 26 Mar 2024)	Explicit cooperation graph, operators	Sparse-reward swarm interception	$0.97$ vs $0$ success
(Ahmad et al., 20 Jul 2025)	Skill hierarchy + CBFs	Multi-agent safety-critical navigation	$>99\%$ safety

For an expanded taxonomy, see also (Qin et al., 22 Sep 2025) (HRCL for urban optimization), (Zheng et al., 8 Mar 2024) (reward machine hierarchies), (Yang et al., 2019, Xu et al., 2021), and (Han et al., 2019) for skill-based and dynamic termination variants.

In summary, hierarchical multi-agent reinforcement learning (HMARL) advances the state of the art in scalable, robust, and interpretable multi-agent coordination by introducing principled structures that separate command from control, exploit temporal and structural abstraction, and enable effective credit assignment, sample-efficient learning, and robust policy generalization across a broad spectrum of challenging environments. The ongoing evolution centers on deeper theory of hierarchy formation, integration of domain knowledge and safety assurances, and further reduction of engineering overhead through policy modularity, transfer, and interpretability.