HRC: Hierarchical RL with Causal Interventions
- HRC is a reinforcement learning framework that employs targeted causal interventions to discover subgoal hierarchies, improving data efficiency in sparse-reward settings.
- It integrates a specialized causal discovery algorithm with intervention ranking methods like Expected Causal Effect and shortest-path ranking to optimize policy training.
- HRC offers formal cost guarantees and demonstrates superior empirical performance on tasks such as 2D-Minecraft compared to traditional HRL baselines.
Hierarchical Reinforcement Learning with Targeted Causal Interventions (HRC) is a framework for reinforcement learning in settings characterized by sparse rewards and long-horizon dependencies. HRC leverages explicit modeling of causal relationships among subgoals, employs principled algorithms for discovering subgoal hierarchies, and strategically selects interventions to accelerate policy learning. The approach provides both formal cost guarantees and substantial empirical improvements in training efficiency relative to prior HRL and causality-driven baselines (Khorasani et al., 6 Jul 2025).
1. Formalization and Problem Statement
HRC formalizes the environment as a subgoal-based Markov decision process, where each state is a binary vector corresponding to resource environment variables, and the action space includes primitive operations (e.g., pick, craft). Transitions are dictated by an underlying structural causal model (SCM):
where denotes causal parents in the SCM, and stochasticity is introduced via . Policy learning is goal-conditioned: seeks to maximize the value
with nonzero reward only upon achievement of the final subgoal (i.e., ). Each subgoal is linked to attaining , with the subgoal structure depicted by a summary graph that encodes causal dependencies via AND/OR logic.
A hierarchical policy architecture is used, with a hierarchy : is a DAG subset of the estimated causal graph, and gives the level of subgoal , enforcing a strict dependency ordering such that policies for child subgoals are only trained after their parents.
2. Causal Discovery for Subgoal Hierarchies
Central to HRC is a specialized causal discovery algorithm (SSD). Assuming causal sufficiency and acyclicity, subgoals are binary and monotonic, and noise is bounded. The method abstracts multi-step SCM effects to a single-step model:
with a logical AND/OR function over parent variables. Causal structure is inferred via sparse-regularized empirical risk minimization:
where , . The minimizer supports recovery of true parents under appropriate regularization. In practice, penalization is employed and the system is solved by gradient-based methods.
Interventions—random and targeted manipulations of controllable subgoals—yield data for causal structure learning. The SSD algorithm iteratively updates the graph estimate as interventions increase controllability in the system.
3. Targeted Subgoal Intervention Strategy
A key innovation is strategic intervention selection, substantially accelerating exploration compared to random subgoal selection. Two core intervention ranking methods are implemented:
- Expected Causal Effect (ECE) Ranking: For each candidate set among the current controllable subgoals ,
The subgoal with maximal ECE is prioritized.
- Shortest-Path Ranking: The estimated causal graph is interpreted as a weighted graph; A* search is used to rank interventions by the expected minimal cost to reach the final goal from current subgoals, assigning costs based on newly required ancestors.
At each iteration, the most promising candidate subgoal , as determined by these ranking rules, is selected for policy training and structure update. This iterative scheme rapidly expands the set of controllable subgoals and enables efficient credit assignment.
4. Theoretical Analysis of Training Cost
HRC provides formal guarantees on the training cost—expected number of interventions and subgoal-policy trainings required to achieve the final goal. Let denote this cost given initial controllable set :
Main results include:
- For b-ary trees with subgoals, random exploration has cost , while targeted HRC achieves .
- On Erdős–Rényi graphs , , random costs , targeted HRC achieves .
The cost gap arises because targeted strategies traverse only relevant ancestors of the final goal, whereas random exploration exhaustively considers an exponentially larger combinatorial set.
5. Empirical Evaluation
HRC is benchmarked on both synthetic graph environments and application domains such as 2D-Minecraft. On synthetic data (100 graphs per experiment), training cost (sum of interventions and policy trainings) is substantially reduced for strategies employing ECE and shortest-path ranking, especially in sparser regimes.
In 2D-Minecraft long-horizon tasks, HRC achieves 50% success at approximately $0.3$ million probes, outperforming CDHRL ($2$M), HAC ($3$M), and HER ($4$M). An ablation demonstrates strong performance for causal-effect and shortest-path rules. For causal discovery, SSD attains lower structure Hamming distance (SHD ) compared to the method of Ke et al. (2019) (SHD ), and is robust under missing variables.
6. Comparison to Related Work
HRC advances over prior causal HRL frameworks in several respects:
- CDHRL (Peng et al., 2022): Employs off-the-shelf causal discovery and uniformly samples interventions. HRC tailors SSD for resource subgoal graphs with provable identifiability, incorporates targeted ranking, and provides end-to-end cost bounds.
- Nguyen et al. (2024): Requires accidental final-goal achievement to infer structure and lacks theoretical guarantees.
- Corcoll & Vicente (2020): Explores subgoals randomly, defining them by observed action effects.
Distinctive features of HRC include principled causal discovery, targeted intervention selection, and theoretical efficiency guarantees, validated by significant empirical acceleration in sparse-reward hierarchical RL tasks (Khorasani et al., 6 Jul 2025).
7. Implications and Extensions
The HRC framework demonstrates that targeted causal interventions—rooted in accurate discovery of causal subgoal hierarchies—yield both significant data efficiency and structural clarity in policy learning for hierarchical RL. The approach is scalable and robust, and a plausible implication is that it can be extended to broader classes of problems by:
- Learning representations of environment variables from pixel-level input (e.g., CausalVAE);
- Accommodating continuous-valued variables and soft interventions;
- Integrating alternative RL algorithms (e.g., off-policy SAC) for greater flexibility.
These extensions would facilitate application in more complex domains, and further bridge advances in causal discovery and hierarchical control (Peng et al., 2022).