Causality-Driven Hierarchical RL (CDHRL)
- CDHRL is a framework in hierarchical reinforcement learning that uses causal structure discovery to define subgoal dependencies.
- It leverages Structural Causal Models and directed acyclic graphs to drive targeted interventions, enhancing exploration and efficient credit assignment.
- Empirical results show that CDHRL can achieve up to 82% success in complex tasks while significantly reducing errors in learned causal graphs.
Causality-Driven Hierarchical Reinforcement Learning (CDHRL) refers to a class of hierarchical RL frameworks in which causality, typically in the form of learned causal graphs over environment variables, subgoals, or effects, shapes both the structure discovery process and the exploration, credit assignment, and ultimately the policies themselves. In contrast to randomness- or heuristics-driven subgoal discovery, CDHRL explicitly seeks and exploits the underlying causal dependencies present in RL tasks, producing more sample-efficient and robust solutions in complex, sparse-reward, and long-horizon environments. This article synthesizes the core CDHRL principles, algorithms, and experimental findings as presented in recent literature (Peng et al., 2022, Corcoll et al., 2020, Zhao et al., 4 May 2025, Khorasani et al., 6 Jul 2025).
1. Formalization and Foundations
CDHRL builds upon the hierarchical RL paradigm, but its defining feature is the formal integration of causal structure discovery—using Structural Causal Models (SCMs) or related graphical models—into the construction of the hierarchy. The MDP setting is often reparameterized as a subgoal-based process: where:
- is the agent state, often interpreted or extracted as a vector of disentangled environment variables .
- denotes subgoals, whose achievement is typically allied to certain configurations of .
- is the set of primitive actions; is the transition kernel; encodes (sub)goal-based rewards; and is the discount factor.
Subgoal hierarchies are then constructed not heuristically but by identifying the directed acyclic graph (DAG) (or equivalent) over variables/subgoals, where an edge (or ) implies a causal dependency: is not achievable unless has first been achieved. The structural equations typically have the form: where denotes the parent variables in for , and the are noise.
Causal credit assignment and exploration mechanisms arise naturally in this setting, as the DAG structure directly encodes the minimal prerequisites and causal delegation between different subgoals and options (Peng et al., 2022, Khorasani et al., 6 Jul 2025).
2. Causal Structure Learning and Discovery
Central to CDHRL is the discovery of the causal hierarchy—the set of subgoal dependencies—by learning a parameterized SCM from agent experience. Several approaches have been developed:
- Two-phase soft-parameter learning: Parameters control edge probabilities, with the likelihood of structural equations updated via sampled subgraphs and data log-likelihood, alternating with updates on to encourage sparse, interpretable DAGs (Peng et al., 2022).
- Sparse parent-set recovery: For each variable, a linear classifier with or regularization is fit to intervention data to recover direct parents (Khorasani et al., 6 Jul 2025).
- Delay-aware distributed discovery: For environments with delayed effects, distributed processes (ranks) learn causal adjacency matrices for different delays, coordinated and filtered via conditional mutual information to eliminate spurious or indirect links (Zhao et al., 4 May 2025).
The data needed for effective SCM fitting is obtained through targeted interventions: subgoal policies are trained for "controllable" variables, and used to generate counterfactual data by driving specific to desired values. This focused sampling, in contrast to naive action or random subgoal perturbations, yields denser, more informative gradient signals for causal graph learning (Peng et al., 2022, Khorasani et al., 6 Jul 2025).
Conditional independence and CMI-based filters are employed to prune indirect or spurious edges, especially when working with high-dimensional or delayed-effect environments (Zhao et al., 4 May 2025).
3. Hierarchical Policy Architecture Informed by Causality
Once the causal hierarchy is established, CDHRL constructs hierarchical goal-conditioned policies or value functions, strictly adhering to the discovered DAG structure. General features include:
- Multi-level policies : Each level specializes in achieving subgoals at DAG depth , with higher levels invoking subordinate level policies for prerequisite subgoals as defined by the DAG (Peng et al., 2022, Khorasani et al., 6 Jul 2025).
- Option chaining and temporal abstraction: Policies dynamically select to execute either primitives or temporally extended "options", corresponding to lower-level subgoal achievement policies. Hierarchies naturally arise via level-by-level training of new subgoals only once all parents are controllable.
- Effect-conditioned or subgoal-conditioned architectures: Some approaches model the low-level policy directly to effectuate a specified "controlled effect," with the high-level policy selecting which effect to target next (Corcoll et al., 2020). In delayed-effect settings, DQN heads are indexed not only by subgoal but also by specific delay and parent set (Zhao et al., 4 May 2025).
- Hindsight Experience Replay (HER): Used pervasively to augment transition tuples, both for reward shaping and for addressing sparse-reward training (Peng et al., 2022, Zhao et al., 4 May 2025).
Reward decomposition schemes are aligned with the causal graph: each subgoal policy maximizes its local (typically indicator) reward, with credit assignment propagating along the effect hierarchy, not the flat action time-scale (Corcoll et al., 2020).
4. Causality-Guided Exploration and Intervention Design
A defining innovation in CDHRL is the use of the learned causal hierarchy to drive exploration and data collection:
- Intervention scheduling: Rather than -greedy or random exploration at the primitive level, interventions are targeted toward subgoals/variables whose achievement is causally relevant for higher-level progress, with scheduling prioritized by:
- Estimated expected causal effect (ECE) on the final goal (Khorasani et al., 6 Jul 2025).
- Shortest-path heuristics in the causal DAG to the final subgoal (Khorasani et al., 6 Jul 2025).
- Reverse data collection and delay awareness: In distributed settings, "reverse segments" of variable length are extracted once interventions reveal sharp state changes, inverted over candidate delays to robustly attribute cause and effect (Zhao et al., 4 May 2025).
- Curriculum shaping: The exploration process naturally reflects the “causal curriculum,” focusing first on foundational subgoals and their combinations before higher-level composite achievements (Peng et al., 2022).
These strategies yield intervention trajectories that improve both the efficiency and accuracy of causal discovery and accelerate subgoal skill acquisition.
5. Theoretical and Empirical Performance
CDHRL frameworks have been analyzed both theoretically and empirically:
- Sample complexity improvements: Theoretical results (e.g., for -ary trees and sparse Erdős–Rényi DAGs) demonstrate that causality-driven targeted interventions can reduce the subgoal training cost from (random) to (targeted) in trees, with similarly sharp improvements in random graphs (Khorasani et al., 6 Jul 2025).
- Superiority in sparse, compositional tasks: Empirical studies in environments such as 2D-Minecraft and Eden demonstrate that CDHRL achieves markedly higher sample efficiency and final performance compared to HRL approaches lacking causal guidance, particularly in long-horizon, sparse-reward settings (Peng et al., 2022, Zhao et al., 4 May 2025).
- Robustness to delayed effects: Distributed CDHRL with delay modeling (e.g., D³HRL) shows high sensitivity and accuracy in discovering true causal links, even when effects and rewards are substantially delayed in time (Zhao et al., 4 May 2025).
- DAG quality: Targeted interventions and hierarchical policy-based intervention design both reduce structural Hamming distance (SHD) and structural intervention distance (SID) of the learned DAGs relative to ground truth, as well as the overfitting to spurious or indirect edges (Peng et al., 2022, Zhao et al., 4 May 2025).
A table summarizing select experimental results is given below (numbers are as reported):
| Method | 2D-Minecraft Success @ steps (%) | Eden Mean Survival (steps) | SHD (GetIron, ) |
|---|---|---|---|
| OHRL | 35 | — | — |
| HAC | 42 | 80 | — |
| MEGA | 40 | 88 | — |
| CDHRL | 82 | 120 | 24.0 |
| D³HRL | — | — | 2.0 |
6. Distinctive Methodological Features and Variants
The literature presents several complementary instantiations of CDHRL, including:
- Controlled-effect HRL: Focuses on disentangling agent-controlled effects from exogenous environmental dynamics. Variational autoencoders over effect vectors induce a compositional latent space for temporally-abstract subgoal selection (Corcoll et al., 2020).
- SSD-based subgoal discovery: Employs sparse subset discovery for high-fidelity parent-set recovery and intervention-based refinement, with tight sample complexity guarantees (Khorasani et al., 6 Jul 2025).
- Distributed, delay-aware CHRL: Adopts a factored SMDP model with distributed causal discovery and explicit filtering of spurious dependencies via CMI, achieving better scalability and resilience to multi-step delays (Zhao et al., 4 May 2025).
This diversity highlights both methodological robustness and adaptability to domains with complex, nontrivial causal-temporal dynamics.
7. Limitations and Open Challenges
While CDHRL offers significant advantages in RL environments with compositional, structured dependencies and sparse/long-delayed rewards, several challenges and limitations remain:
- Assumptions on environment representations: Many approaches require access to disentangled, entity-centric state spaces; performance in high-dimensional, raw-pixel domains still depends on representation learning (Peng et al., 2022, Corcoll et al., 2020).
- Determinism and normality heuristics: Certain CDHRL variants assume deterministic transitions or employ heuristic “normality” criteria for effect isolation. This places constraints on applicability in stochastic, partially observable, or multi-agent domains (Corcoll et al., 2020).
- Interventional limitations: Causal discovery and curriculum learning effectiveness may degrade if full or accurate interventions are not possible due to environment design or constraints.
Future directions identified in the literature include extending CDHRL to richer state representations, robustness to confounding and partial observability, and applications to multi-agent cooperation where joint effects and causation must be discovered (Peng et al., 2022, Corcoll et al., 2020, Zhao et al., 4 May 2025).
Principal references:
- "Causality-driven Hierarchical Structure Discovery for Reinforcement Learning" (Peng et al., 2022)
- "Disentangling causal effects for hierarchical reinforcement learning" (Corcoll et al., 2020)
- "Hierarchical Reinforcement Learning with Targeted Causal Interventions" (Khorasani et al., 6 Jul 2025)
- "D3HRL: A Distributed Hierarchical Reinforcement Learning Approach Based on Causal Discovery and Spurious Correlation Detection" (Zhao et al., 4 May 2025)