Causality-Driven Hierarchical RL (CDHRL)

Updated 5 March 2026

CDHRL is a framework in hierarchical reinforcement learning that uses causal structure discovery to define subgoal dependencies.
It leverages Structural Causal Models and directed acyclic graphs to drive targeted interventions, enhancing exploration and efficient credit assignment.
Empirical results show that CDHRL can achieve up to 82% success in complex tasks while significantly reducing errors in learned causal graphs.

Causality-Driven Hierarchical Reinforcement Learning (CDHRL) refers to a class of hierarchical RL frameworks in which causality, typically in the form of learned causal graphs over environment variables, subgoals, or effects, shapes both the structure discovery process and the exploration, credit assignment, and ultimately the policies themselves. In contrast to randomness- or heuristics-driven subgoal discovery, CDHRL explicitly seeks and exploits the underlying causal dependencies present in RL tasks, producing more sample-efficient and robust solutions in complex, sparse-reward, and long-horizon environments. This article synthesizes the core CDHRL principles, algorithms, and experimental findings as presented in recent literature (Peng et al., 2022, Corcoll et al., 2020, Zhao et al., 4 May 2025, Khorasani et al., 6 Jul 2025).

1. Formalization and Foundations

CDHRL builds upon the hierarchical RL paradigm, but its defining feature is the formal integration of causal structure discovery—using Structural Causal Models (SCMs) or related graphical models—into the construction of the hierarchy. The MDP setting is often reparameterized as a subgoal-based process: $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{G}, T, R, \gamma)$ where:

$\mathcal{S}$ is the agent state, often interpreted or extracted as a vector of $M$ disentangled environment variables $X = (X_1, \dots, X_M)$ .
$\mathcal{G}$ denotes subgoals, whose achievement is typically allied to certain configurations of $X$ .
$\mathcal{A}$ is the set of primitive actions; $T$ is the transition kernel; $R$ encodes (sub)goal-based rewards; and $\gamma$ is the discount factor.

Subgoal hierarchies are then constructed not heuristically but by identifying the directed acyclic graph (DAG) $C$ (or equivalent) over variables/subgoals, where an edge $g_i \to g_j$ (or $X_i \to X_j$ ) implies a causal dependency: $g_j$ is not achievable unless $g_i$ has first been achieved. The structural equations typically have the form: $X_{i, t+1} = f_i(X_{\mathrm{pa}(i)}, a_t, N_i)$ where $\mathrm{pa}(i)$ denotes the parent variables in $C$ for $X_i$ , and the $N_i$ are noise.

Causal credit assignment and exploration mechanisms arise naturally in this setting, as the DAG structure directly encodes the minimal prerequisites and causal delegation between different subgoals and options (Peng et al., 2022, Khorasani et al., 6 Jul 2025).

2. Causal Structure Learning and Discovery

Central to CDHRL is the discovery of the causal hierarchy—the set of subgoal dependencies—by learning a parameterized SCM from agent experience. Several approaches have been developed:

Two-phase soft-parameter learning: Parameters $\eta \in \mathbb{R}^{M \times M}$ control edge probabilities, with the likelihood of structural equations $f_i(\cdot; \theta_i)$ updated via sampled subgraphs and data log-likelihood, alternating with updates on $\eta$ to encourage sparse, interpretable DAGs (Peng et al., 2022).
Sparse parent-set recovery: For each variable, a linear classifier with $\ell_1$ or $\ell_0$ regularization is fit to intervention data to recover direct parents (Khorasani et al., 6 Jul 2025).
Delay-aware distributed discovery: For environments with delayed effects, distributed processes (ranks) learn causal adjacency matrices for different delays, coordinated and filtered via conditional mutual information to eliminate spurious or indirect links (Zhao et al., 4 May 2025).

The data needed for effective SCM fitting is obtained through targeted interventions: subgoal policies are trained for "controllable" variables, and used to generate counterfactual data by driving specific $X_i$ to desired values. This focused sampling, in contrast to naive action or random subgoal perturbations, yields denser, more informative gradient signals for causal graph learning (Peng et al., 2022, Khorasani et al., 6 Jul 2025).

Conditional independence and CMI-based filters are employed to prune indirect or spurious edges, especially when working with high-dimensional or delayed-effect environments (Zhao et al., 4 May 2025).

3. Hierarchical Policy Architecture Informed by Causality

Once the causal hierarchy is established, CDHRL constructs hierarchical goal-conditioned policies or value functions, strictly adhering to the discovered DAG structure. General features include:

Multi-level policies $\{\pi^1, \dots, \pi^L\}$ : Each level $\ell$ specializes in achieving subgoals at DAG depth $\ell$ , with higher levels invoking subordinate level policies for prerequisite subgoals as defined by the DAG (Peng et al., 2022, Khorasani et al., 6 Jul 2025).
Option chaining and temporal abstraction: Policies dynamically select to execute either primitives or temporally extended "options", corresponding to lower-level subgoal achievement policies. Hierarchies naturally arise via level-by-level training of new subgoals only once all parents are controllable.
Effect-conditioned or subgoal-conditioned architectures: Some approaches model the low-level policy directly to effectuate a specified "controlled effect," with the high-level policy selecting which effect to target next (Corcoll et al., 2020). In delayed-effect settings, DQN heads are indexed not only by subgoal but also by specific delay and parent set (Zhao et al., 4 May 2025).
Hindsight Experience Replay (HER): Used pervasively to augment transition tuples, both for reward shaping and for addressing sparse-reward training (Peng et al., 2022, Zhao et al., 4 May 2025).

Reward decomposition schemes are aligned with the causal graph: each subgoal policy maximizes its local (typically indicator) reward, with credit assignment propagating along the effect hierarchy, not the flat action time-scale (Corcoll et al., 2020).

4. Causality-Guided Exploration and Intervention Design

A defining innovation in CDHRL is the use of the learned causal hierarchy to drive exploration and data collection:

Intervention scheduling: Rather than $\epsilon$ $ϵ$ -greedy or random exploration at the primitive level, interventions are targeted toward subgoals/variables whose achievement is causally relevant for higher-level progress, with scheduling prioritized by:
- Estimated expected causal effect (ECE) on the final goal (Khorasani et al., 6 Jul 2025).
- Shortest-path heuristics in the causal DAG to the final subgoal (Khorasani et al., 6 Jul 2025).
Reverse data collection and delay awareness: In distributed settings, "reverse segments" of variable length are extracted once interventions reveal sharp state changes, inverted over candidate delays to robustly attribute cause and effect (Zhao et al., 4 May 2025).
Curriculum shaping: The exploration process naturally reflects the “causal curriculum,” focusing first on foundational subgoals and their combinations before higher-level composite achievements (Peng et al., 2022).

These strategies yield intervention trajectories that improve both the efficiency and accuracy of causal discovery and accelerate subgoal skill acquisition.

5. Theoretical and Empirical Performance

CDHRL frameworks have been analyzed both theoretically and empirically:

Sample complexity improvements: Theoretical results (e.g., for $b$ -ary trees and sparse Erdős–Rényi DAGs) demonstrate that causality-driven targeted interventions can reduce the subgoal training cost from $\Omega(n^2 b)$ (random) to $O(b \log^2 n)$ (targeted) in trees, with similarly sharp improvements in random graphs (Khorasani et al., 6 Jul 2025).
Superiority in sparse, compositional tasks: Empirical studies in environments such as 2D-Minecraft and Eden demonstrate that CDHRL achieves markedly higher sample efficiency and final performance compared to HRL approaches lacking causal guidance, particularly in long-horizon, sparse-reward settings (Peng et al., 2022, Zhao et al., 4 May 2025).
Robustness to delayed effects: Distributed CDHRL with delay modeling (e.g., D³HRL) shows high sensitivity and accuracy in discovering true causal links, even when effects and rewards are substantially delayed in time (Zhao et al., 4 May 2025).
DAG quality: Targeted interventions and hierarchical policy-based intervention design both reduce structural Hamming distance (SHD) and structural intervention distance (SID) of the learned DAGs relative to ground truth, as well as the overfitting to spurious or indirect edges (Peng et al., 2022, Zhao et al., 4 May 2025).

A table summarizing select experimental results is given below (numbers are as reported):

Method	2D-Minecraft Success @ $5 \times 10^6$ steps (%)	Eden Mean Survival (steps)	SHD (GetIron, $\tau_{max}=4$ )
OHRL	35	—	—
HAC	42	80	—
MEGA	40	88	—
CDHRL	82	120	24.0
D³HRL	—	—	2.0

6. Distinctive Methodological Features and Variants

The literature presents several complementary instantiations of CDHRL, including:

Controlled-effect HRL: Focuses on disentangling agent-controlled effects from exogenous environmental dynamics. Variational autoencoders over effect vectors induce a compositional latent space for temporally-abstract subgoal selection (Corcoll et al., 2020).
SSD-based subgoal discovery: Employs sparse subset discovery for high-fidelity parent-set recovery and intervention-based refinement, with tight sample complexity guarantees (Khorasani et al., 6 Jul 2025).
Distributed, delay-aware CHRL: Adopts a factored SMDP model with distributed causal discovery and explicit filtering of spurious dependencies via CMI, achieving better scalability and resilience to multi-step delays (Zhao et al., 4 May 2025).

This diversity highlights both methodological robustness and adaptability to domains with complex, nontrivial causal-temporal dynamics.

7. Limitations and Open Challenges

While CDHRL offers significant advantages in RL environments with compositional, structured dependencies and sparse/long-delayed rewards, several challenges and limitations remain:

Assumptions on environment representations: Many approaches require access to disentangled, entity-centric state spaces; performance in high-dimensional, raw-pixel domains still depends on representation learning (Peng et al., 2022, Corcoll et al., 2020).
Determinism and normality heuristics: Certain CDHRL variants assume deterministic transitions or employ heuristic “normality” criteria for effect isolation. This places constraints on applicability in stochastic, partially observable, or multi-agent domains (Corcoll et al., 2020).
Interventional limitations: Causal discovery and curriculum learning effectiveness may degrade if full or accurate interventions are not possible due to environment design or constraints.

Future directions identified in the literature include extending CDHRL to richer state representations, robustness to confounding and partial observability, and applications to multi-agent cooperation where joint effects and causation must be discovered (Peng et al., 2022, Corcoll et al., 2020, Zhao et al., 4 May 2025).

Principal references:

"Causality-driven Hierarchical Structure Discovery for Reinforcement Learning" (Peng et al., 2022)
"Disentangling causal effects for hierarchical reinforcement learning" (Corcoll et al., 2020)
"Hierarchical Reinforcement Learning with Targeted Causal Interventions" (Khorasani et al., 6 Jul 2025)
"D3HRL: A Distributed Hierarchical Reinforcement Learning Approach Based on Causal Discovery and Spurious Correlation Detection" (Zhao et al., 4 May 2025)

Markdown Report Issue Upgrade to Chat

References (4)

Causality-driven Hierarchical Structure Discovery for Reinforcement Learning (2022)

Disentangling causal effects for hierarchical reinforcement learning (2020)

D3HRL: A Distributed Hierarchical Reinforcement Learning Approach Based on Causal Discovery and Spurious Correlation Detection (2025)

Hierarchical Reinforcement Learning with Targeted Causal Interventions (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Causality-Driven Hierarchical RL (CDHRL).

Causality-Driven Hierarchical RL (CDHRL)

1. Formalization and Foundations

2. Causal Structure Learning and Discovery

3. Hierarchical Policy Architecture Informed by Causality

4. Causality-Guided Exploration and Intervention Design

5. Theoretical and Empirical Performance

6. Distinctive Methodological Features and Variants

7. Limitations and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Causality-Driven Hierarchical RL (CDHRL)

1. Formalization and Foundations

2. Causal Structure Learning and Discovery

3. Hierarchical Policy Architecture Informed by Causality

4. Causality-Guided Exploration and Intervention Design

5. Theoretical and Empirical Performance

6. Distinctive Methodological Features and Variants

7. Limitations and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research