Strict Subgoal Execution in Hierarchical RL

Updated 2 July 2025

Strict Subgoal Execution (SSE) is a hierarchical reinforcement learning framework that enforces strictly reachable subgoals for reliable, single-step low-level transitions.
The method employs graph-based planning and decoupled exploration to systematically cover the goal space while avoiding high-failure transitions.
SSE enhances sample efficiency and robustness on long-horizon, sparse-reward tasks, demonstrating superior convergence and success rates in complex benchmarks.

Strict Subgoal Execution (SSE) is a hierarchical reinforcement learning framework that structurally constrains the high-level agent to select only reliably reachable subgoals, ensuring that each high-level decision corresponds to a feasible, single-step transition for the low-level policy. SSE addresses the fundamental challenge of long-horizon, goal-conditioned tasks in sparse-reward environments, where previous approaches often suffer from subgoal infeasibility and brittle or inefficient planning. The framework systematically integrates strict reachability enforcement, decoupled exploration, and dynamic path refinement to enhance both reliability and sample efficiency.

1. Mechanistic Foundations of Strict Subgoal Execution

In SSE, the agent operates over a graph-based representation of the goal space, with nodes representing landmark states and edges encoding transitions that can be directly achieved by the low-level policy. The high-level policy $\pi^h$ selects subgoals $\tilde{g}_t$ at each episode step, but, crucially, the transition is only permitted if the low-level policy $\pi^l$ successfully attains $\tilde{g}_t$ within a single invocation, without a fixed time horizon. Failure to reach a subgoal—defined by a tight distance threshold—results in immediate episode termination and a zero reward for the high-level transition.

Mathematically, the high-level selection process is: $\pi^h(\tilde{g}_t | \tilde{s}_t, g) = \begin{cases} \tilde{g}_{\max,t} := \arg\max_{\tilde{g} \in \mathcal{G}} Q^h(\tilde{s}_t, \tilde{g}) & \text{with probability } 1 - \epsilon \ \tilde{g}_{\text{rand}} \in \mathcal{G} & \text{with probability } \epsilon \end{cases}$ with replay transitions only recorded for strictly completed subgoals: $(\tilde{s}_t, g, \tilde{g}_t, \sum_{j=t}^{t+k_t-1} r_j, \tilde{s}_{t+k_t}), \,\,\,\text{if subgoal was reached}$

$(\tilde{s}_t, g, \tilde{g}_t, 0, \tilde{s}_T), \,\,\,\text{otherwise}$

The explicit structural limitation prevents the accumulation of high-level errors stemming from unreached subgoals, fundamentally shifting policy optimization towards robust and reliable task decomposition.

2. Graph Construction, Edge Costs, and Planning

SSE constructs a discrete goal-space graph $G = (V, E)$ , where $V$ contains a (potentially grid-structured) set of abstracted states, and $E$ encompasses feasible edges corresponding to the low-level agent's reach.

Edge costs between nodes $v_1$ and $v_2$ capture the empirical difficulty of the transition under the low-level policy: $d(v_1 \rightarrow v_2) := \log_{\gamma}(1 + (1-\gamma) Q^G(v_1, v_2, \pi^l))$ Here, $Q^G$ is the expected value of transitioning given the current low-level skill.

The high-level planning problem becomes that of shortest-path search on this graph. Due to the strict reachability constraint, the set of available edges dynamically reflects which subgoals are reliably attainable at any point in training.

3. Decoupled Exploration Policy

To offset the increased termination frequency resulting from strict subgoal enforcement, SSE employs a decoupled exploration policy, $\pi^{\text{exp}}$ , designed to systematically cover the goal space—including rarely visited or hard-to-reach regions—thereby improving exploration efficiency:

$\pi^{\text{exp}}(\tilde{g}_t | \tilde{s}_t, g) = \begin{cases} g & \text{with probability } 1/3 \ \tilde{g}_{\max, t} & \text{with probability } 1/3 \ \tilde{g}_{\text{novel}} \in C_{\mathcal{G}^m} & \text{with probability } 1/3 \end{cases}$

Here:

$g$ : final task goal.
$\tilde{g}_{\max, t}$ : current highest-value subgoal.
$\tilde{g}_{\text{novel}}$ : subgoal sampled from the least-visited (most novel) grid cell $C_{\mathcal{G}^m}$ .

This mixed-mode approach drives both reward seeking and systematic spatial exploration, with the blending ratio adapting over training to give more weight to the learned policy as experience accumulates.

4. Failure-Aware Path Refinement and Reliability

Recognizing that some goal-space regions may be inherently unreliable (i.e., transitions frequently fail under the current low-level policy), SSE incorporates a failure-aware refinement mechanism that inflates the planning cost of subgoals in high-failure cells:

$\mathrm{ratio}_{\mathrm{fail}}(C_{\mathcal{G}^m}) = \frac{N_{\mathrm{fail}}(C_{\mathcal{G}^m})}{N(C_{\mathcal{G}^m})}$

$\tilde{d}(v_1 \rightarrow v_2) = d(v_1 \rightarrow v_2) \times \max\big(1, c_{\mathrm{dist}} \cdot \mathrm{ratio}_{\mathrm{fail}}(C_{\mathcal{G}^m})\big) \,\,\, \forall v_2 \in C_{\mathcal{G}^m}$

By updating costs in the goal graph according to observed failure rates, the planner dynamically avoids unreliable regions, favoring robust, high-success paths at test time.

5. Empirical Evaluation and Performance Analysis

SSE has been evaluated on a collection of nine complex long-horizon RL benchmarks, including AntMaze navigation tasks (U-Maze, Bottleneck, Double Bottleneck), compositional tasks (AntKeyChest, DoubleKeyChest) and control environments (ReacherWall). Empirical findings demonstrate:

Superior convergence speed: Faster learning of high success rates than classical HRL and graph-based baselines.
Increased task success: Consistently higher final task and episode success rates, especially in compositional, sequential-reward, and bottlenecked environments.
Reduced high-level planning horizon: SSE policies typically require only a few reliably completed high-level subgoals per episode.
Improved robustness: Dynamic edge cost inflation avoids low-level policy failure zones, as shown by trajectory analyses in bottleneck environments.

6. Applications and Broader Implications

SSE is particularly effective in domains characterized by:

Long-horizon composition and strict task sequencing requirements (e.g., multi-stage robotic manipulation, modular assembly, complex navigation).
Sparse rewards with significant delayed feedback.
Environments where reliable, interpretable policy decomposition is essential for safety or debugging.

More broadly, SSE shifts the focus of hierarchical RL research toward feasibility-constrained subgoal selection and structurally enforced decompositions. The framework's use of decoupled exploration and dynamic path refinement introduces strategic elements that facilitate curriculum-like behavior and practical failure avoidance, suggesting new directions for compositional RL and robust planning in real-world tasks.

Summary Table

Aspect	SSE Contribution
Strict Subgoal Execution	Enforces single-step reachability; subgoals must be reliably completed for transition validity
Decoupled Exploration	Systematically covers goal space; balances reward seeking and novelty-driven exploration
Failure-Aware Path Refinement	Adjusts planning dynamically to avoid unreliable (high-failure) regions; improves test-time robustness
Experimental Superiority	Demonstrated faster convergence and higher success rates on complex long-horizon RL benchmarks
Applications & Implications	Enables reliable, sample-efficient, and interpretable HRL; applicable in robotics, navigation, and compositional tasks

In summary, Strict Subgoal Execution provides a principled, effective framework for reliable hierarchical RL in long-horizon, sparse-reward settings, structurally addressing issues of subgoal infeasibility, exploration exhaustion, and robustness to failure by combining strict completion constraints with adaptive exploration and planning mechanisms.

PDF Markdown Chat (Upgrade)