Papers
Topics
Authors
Recent
Search
2000 character limit reached

HRC: Hierarchical RL with Causal Interventions

Updated 5 March 2026
  • HRC is a reinforcement learning framework that employs targeted causal interventions to discover subgoal hierarchies, improving data efficiency in sparse-reward settings.
  • It integrates a specialized causal discovery algorithm with intervention ranking methods like Expected Causal Effect and shortest-path ranking to optimize policy training.
  • HRC offers formal cost guarantees and demonstrates superior empirical performance on tasks such as 2D-Minecraft compared to traditional HRL baselines.

Hierarchical Reinforcement Learning with Targeted Causal Interventions (HRC) is a framework for reinforcement learning in settings characterized by sparse rewards and long-horizon dependencies. HRC leverages explicit modeling of causal relationships among subgoals, employs principled algorithms for discovering subgoal hierarchies, and strategically selects interventions to accelerate policy learning. The approach provides both formal cost guarantees and substantial empirical improvements in training efficiency relative to prior HRL and causality-driven baselines (Khorasani et al., 6 Jul 2025).

1. Formalization and Problem Statement

HRC formalizes the environment as a subgoal-based Markov decision process, where each state is a binary vector Xt=(X1t,...,Xnt)X^t = (X^t_1, ..., X^t_n) corresponding to resource environment variables, and the action space includes primitive operations (e.g., pick, craft). Transitions are dictated by an underlying structural causal model (SCM):

Xit+1=fi(pa(Xit+1),At,ϵit+1)X_i^{t+1} = f_i(pa(X_i^{t+1}), A^t, \epsilon_i^{t+1})

where pa(Xit+1)pa(X_i^{t+1}) denotes causal parents in the SCM, and stochasticity is introduced via ϵit+1Bernoulli(ρ<12)\epsilon_i^{t+1} \sim \text{Bernoulli}(\rho < \tfrac{1}{2}). Policy learning is goal-conditioned: π(as,g)\pi(a|s,g) seeks to maximize the value

Vπ(s,g)=E[tγtR(st,at,g)]V^{\pi}(s,g) = \mathbb{E}\left[\sum_t \gamma^t R(s_t,a_t,g)\right]

with nonzero reward only upon achievement of the final subgoal gng_n (i.e., Xn=1X_n = 1). Each subgoal gig_i is linked to attaining Xi=1X_i = 1, with the subgoal structure depicted by a summary graph GG that encodes causal dependencies via AND/OR logic.

A hierarchical policy architecture is used, with a hierarchy H=(H,L)H = (H,L): HH is a DAG subset of the estimated causal graph, and L(gi)L(g_i) gives the level of subgoal gig_i, enforcing a strict dependency ordering such that policies for child subgoals are only trained after their parents.

2. Causal Discovery for Subgoal Hierarchies

Central to HRC is a specialized causal discovery algorithm (SSD). Assuming causal sufficiency and acyclicity, subgoals are binary and monotonic, and noise is bounded. The method abstracts multi-step SCM effects to a single-step model:

Xit+1=θi(Xt)ϵit+1X_i^{t+1} = \theta_i(X^t) \oplus \epsilon_i^{t+1}

with θi\theta_i a logical AND/OR function over parent variables. Causal structure is inferred via sparse-regularized empirical risk minimization:

Li(β)=E[(X^iXi)2]+λβ0L_i(\beta) = \mathbb{E}[(\hat{X}_i-X_i)^2] + \lambda \|\beta\|_0

where Si(X)=βi,0+jβi,jXjS_i(X) = \beta_{i,0} + \sum_j \beta_{i,j} X_j, X^i=ISi(X)>0\hat{X}_i = \mathbb{I}_{S_i(X) > 0}. The minimizer supports recovery of true parents under appropriate regularization. In practice, 1\ell_1 penalization is employed and the system is solved by gradient-based methods.

Interventions—random and targeted manipulations of controllable subgoals—yield data for causal structure learning. The SSD algorithm iteratively updates the graph estimate G^\hat{G} as interventions increase controllability in the system.

3. Targeted Subgoal Intervention Strategy

A key innovation is strategic intervention selection, substantially accelerating exploration compared to random subgoal selection. Two core intervention ranking methods are implemented:

  • Expected Causal Effect (ECE) Ranking: For each candidate set AA among the current controllable subgoals CSt1\mathcal{CS}_{t-1},

ECE(A)=E[Xndo(XA=1),do(XCSA=0)]E[Xndo(XA=0),do(XCSA=0)]\text{ECE}(A) = \mathbb{E}[X_n \mid do(X_A = 1), do(X_{\mathcal{CS} \setminus A} = 0)] - \mathbb{E}[X_n \mid do(X_A = 0), do(X_{\mathcal{CS} \setminus A} = 0)]

The subgoal with maximal ECE is prioritized.

  • Shortest-Path Ranking: The estimated causal graph is interpreted as a weighted graph; A* search is used to rank interventions by the expected minimal cost to reach the final goal from current subgoals, assigning costs based on newly required ancestors.

At each iteration, the most promising candidate subgoal xx^*, as determined by these ranking rules, is selected for policy training and structure update. This iterative scheme rapidly expands the set of controllable subgoals and enables efficient credit assignment.

4. Theoretical Analysis of Training Cost

HRC provides formal guarantees on the training cost—expected number of interventions and subgoal-policy trainings required to achieve the final goal. Let Cgn(I)C_{g_n}(I) denote this cost given initial controllable set II:

Cgn(I)=xCSIp(II{x})[Ctrans(I,I{x})+Cgn(I{x})]C_{g_n}(I) = \sum_{x \in \mathcal{CS}\setminus I} p(I \rightarrow I \cup \{x\}) \left[ C_{trans}(I, I \cup \{x\}) + C_{g_n}(I \cup \{x\}) \right]

Main results include:

  • For b-ary trees with nn subgoals, random exploration has cost Ω(n2b)\Omega(n^2b), while targeted HRC achieves O((logn)2b)O((\log n)^2 b).
  • On Erdős–Rényi graphs G(n,p)G(n, p), p=clognn1p = c \frac{\log n}{n-1}, random costs Ω(n2)\Omega(n^2), targeted HRC achieves O(n4/3+2c/3logn)O(n^{4/3+2c/3}\log n).

The cost gap arises because targeted strategies traverse only relevant ancestors of the final goal, whereas random exploration exhaustively considers an exponentially larger combinatorial set.

5. Empirical Evaluation

HRC is benchmarked on both synthetic graph environments and application domains such as 2D-Minecraft. On synthetic data (100 graphs per experiment), training cost (sum of interventions and policy trainings) is substantially reduced for strategies employing ECE and shortest-path ranking, especially in sparser regimes.

In 2D-Minecraft long-horizon tasks, HRC achieves 50% success at approximately $0.3$ million probes, outperforming CDHRL ($2$M), HAC ($3$M), and HER ($4$M). An ablation demonstrates strong performance for causal-effect and shortest-path rules. For causal discovery, SSD attains lower structure Hamming distance (SHD 12.3\approx12.3) compared to the method of Ke et al. (2019) (SHD 19.8\approx19.8), and is robust under 20%20\% missing variables.

HRC advances over prior causal HRL frameworks in several respects:

  • CDHRL (Peng et al., 2022): Employs off-the-shelf causal discovery and uniformly samples interventions. HRC tailors SSD for resource subgoal graphs with provable identifiability, incorporates targeted ranking, and provides end-to-end cost bounds.
  • Nguyen et al. (2024): Requires accidental final-goal achievement to infer structure and lacks theoretical guarantees.
  • Corcoll & Vicente (2020): Explores subgoals randomly, defining them by observed action effects.

Distinctive features of HRC include principled causal discovery, targeted intervention selection, and theoretical efficiency guarantees, validated by significant empirical acceleration in sparse-reward hierarchical RL tasks (Khorasani et al., 6 Jul 2025).

7. Implications and Extensions

The HRC framework demonstrates that targeted causal interventions—rooted in accurate discovery of causal subgoal hierarchies—yield both significant data efficiency and structural clarity in policy learning for hierarchical RL. The approach is scalable and robust, and a plausible implication is that it can be extended to broader classes of problems by:

  • Learning representations of environment variables from pixel-level input (e.g., CausalVAE);
  • Accommodating continuous-valued variables and soft interventions;
  • Integrating alternative RL algorithms (e.g., off-policy SAC) for greater flexibility.

These extensions would facilitate application in more complex domains, and further bridge advances in causal discovery and hierarchical control (Peng et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical RL with Targeted Causal Interventions (HRC).