DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents (2502.01956v2)

Published 4 Feb 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Hierarchical Reinforcement Learning (HRL) agents often struggle with long-horizon visual planning due to their reliance on error-prone distance metrics. We propose Discrete Hierarchical Planning (DHP), a method that replaces continuous distance estimates with discrete reachability checks to evaluate subgoal feasibility. DHP recursively constructs tree-structured plans by decomposing long-term goals into sequences of simpler subtasks, using a novel advantage estimation strategy that inherently rewards shorter plans and generalizes beyond training depths. In addition, to address the data efficiency challenge, we introduce an exploration strategy that generates targeted training examples for the planning modules without needing expert data. Experiments in 25-room navigation environments demonstrate $100\%$ success rate (vs $82\%$ baseline) and $73$-step average episode length (vs $158$-step baseline). The method also generalizes to momentum-based control tasks and requires only $\log N$ steps for replanning. Theoretical analysis and ablations validate our design choices.

Summary

The paper introduces a discrete reachability-based reward scheme that overcomes the challenges of continuous distance metrics in hierarchical planning.
The methodology decomposes tasks into binary trees using min-based returns and conditional VAEs to predict efficient subgoals.
Empirical results show DHP achieves a 99% success rate and efficient inference, outperforming state-of-the-art benchmarks in long-horizon navigation.

This paper introduces Discrete Hierarchical Planning (DHP), a novel method for training Hierarchical Reinforcement Learning (HRL) agents on long-horizon visual planning tasks. DHP addresses the limitations of traditional hierarchical planning methods that rely on learning continuous distance metrics between states, which can be difficult to learn accurately, especially with self-generated data where suboptimal policies can lead to erroneous distance estimates.

Instead of distance metrics, DHP proposes using a discrete, reachability-based reward scheme. The core idea is to determine if a subgoal is directly achievable by a low-level "worker" policy within a fixed number of steps ( $K$ ). This binary reachability check provides a clearer, more robust learning signal compared to continuous distances.

Methodology: Discrete Hierarchical Planning

Hierarchical Planning Policy: A goal-conditioned planning policy $\pi_\theta(s_{g1} | s_t, s_g)$ learns to predict an intermediate subgoal $s_{g1}$ given the current state $s_t$ and the final goal $s_g$ .
Plan Unrolling: This policy is applied recursively, decomposing the original task $(s_t, s_g)$ into $(s_t, s_{g1})$ and $(s_{g1}, s_g)$ , and further decomposing these subtasks. This creates a binary tree $\tau$ where each node $n_i$ represents a subtask $(s_{ti}, s_{gi})$ . The tree is unrolled up to a maximum depth $D$ .
Discrete Rewards: For each node $n_i$ in the tree, the worker policy is simulated for $K$ steps starting from $s_{ti}$ with $s_{gi}$ as its goal. If the worker's final state $s_{fi}$ is sufficiently close to the subgoal $s_{gi}$ (measured by cosine_max similarity > $\Delta_R$ ), the node is marked as reachable (terminal $T_i=True$ ) and receives a reward $R_i = 1$ . Otherwise, $R_i = 0$ . Reachability propagates up the tree ( $T_i$ is true if its parent was already terminal).
Advantage Estimation for Trees: To encourage finding reachable subgoals quickly (shorter plans/shallower trees), a novel return formulation is proposed. Instead of summing child returns, it takes the minimum, combined with discounting:

$G_i = \min(R_{2i+1} + \gamma G_{2i+1}, R_{2i+2} + \gamma G_{2i+2})$

This ensures high returns only if both branches eventually lead to reachable subgoals, and the discount factor $\gamma$ penalizes deeper trees. Lambda returns ( $G_i^\lambda$ ) are used for variance reduction, incorporating a learned value function $v_\phi$ :

$G_i^\lambda = \min(R_{2i+1} + \gamma((1-\lambda) v_\phi(n_{2i+1}) + \lambda G_{2i+1}^\lambda), R_{2i+2} + \gamma((1-\lambda) v_\phi(s_{2i+2}) + \lambda G_{2i+2}^\lambda))$

The value function $v_\phi$ is trained to predict these returns, and importantly, allows bootstrapping at non-terminal leaf nodes, enabling generalization beyond the training depth $D$ .
Policy Gradients: Policy gradients are derived for the tree structure, allowing the planning policy $\pi_\theta$ to be updated using standard policy gradient methods (like SAC) with the calculated tree advantages ( $A_i = G_i^\lambda - v_\phi(n_i)$ ):

$\nabla_\theta J(\theta) = \mathbb{E}_\tau \sum_{i=0}^{2^D-2} A^i(\tau) \nabla_\theta \log \pi_\theta(a_i|n_i)$

The policy loss includes an entropy term for exploration.

Agent Architecture

The agent employs an HRL architecture with three main components:

Perception: Uses a Recurrent State Space Model (RSSM) \cite{hafner2019learning} to learn compact state representations $s_t$ from sequences of image observations ( $o_t$ ). This allows for efficient on-policy training using imagined rollouts. Static state representations are also learned via an MLP for single observations (needed for goal states).
Worker: A goal-conditioned Soft Actor-Critic (SAC) agent trained to reach a given worker subgoal $s_{wg}$ within $K$ steps. Its reward is the cosine_max similarity to $s_{wg}$ .
Manager (Planning Policy): The DHP planning policy $\pi_\theta$ , also implemented as a goal-conditioned SAC agent, trained using the tree-based returns and advantages described above. It operates in the latent space of Conditional State Recall modules.

Key Implementation Details

Conditional State Recall (CSR): To reduce the search space for subgoals, two Conditional VAEs (CVAEs) are used:
- Goal-CSR (GCSR): Predicts a midway state $s_{t+q/2}$ given start $s_t$ and end $s_{t+q}$ states. The manager predicts a latent variable $z$ , and the GCSR decoder generates the subgoal state: $\hat{s}_{t+q/2} = \text{Dec}_G(s_t, s_{t+q}, z)$ . Trained on replay data triplets at multiple time scales $q \in \{2K, 4K, ...\}$ .
- Initial-CSR (ICSR): Predicts a nearby state $s_{t+K}$ given the start state $s_t$ . The decoder generates the state: $\hat{s}_{t+K} = \text{Dec}_I(s_t, z)$ . Used for efficient reachability checks during inference and goal generation during exploration.
Plan Inference: At test time, only the first branch of the tree is unrolled recursively until a reachable subgoal is found (using ICSR for fast checking). The first reachable subgoal along this path is given to the worker. This takes logarithmic time ( $O(\log N)$ for horizon $N$ ). The agent can plan deeper ( $D_{Inf}$ ) than it was trained ( $D$ ).
Memory Augmented Exploration: A novel exploration strategy is proposed to collect relevant data. Instead of rewarding visits to novel states (which led to suboptimal trajectories), the agent is rewarded for traversing novel path segments and state transitions. Novelty is measured by the reconstruction error of the GCSR and ICSR modules.

$R^E_t = \Vert s_{t} - \text{Dec}_I(s_{t-K},z_I)\Vert^2 + \sum_{q \in Q} \Vert s_{t-q/2} - \text{Dec}_G(s_{t-q},s_{t},z_G) \Vert^2$

Since these rewards depend on past states, the exploratory policy $\pi_E$ is augmented with a memory input $\text{mem}_t = \{s_{t-K}, s_{t-2K}, ...\}$ containing relevant past states.

Results

Evaluated on a challenging 25-room visual navigation task requiring long-horizon planning (>150 steps average for previous SOTA).
DHP significantly outperforms previous benchmarks (GC BC, Visual Foresight, GCP \cite{pertsch2020long}), achieving a 99% success rate (vs. 82% for GCP) and much shorter average path lengths (71.37 vs. 158.06 for GCP).
Ablation studies confirm the benefits of the proposed memory-augmented exploration strategy over vanilla exploration and using expert data. The agent without memory performs worse than the full model but better than vanilla/expert data.
The agent demonstrates generalization beyond the maximum training tree depth ( $D=3$ performs as well as $D=5$ ).
Alternative reward schemes show that a negative reward variant performs similarly well, while a distance-based sum approach performs poorly, validating the discrete reachability approach.
Inference is efficient, running at >200 Hz on an NVIDIA RTX 4090.

Practical Implications

DHP provides a practical alternative to distance-based hierarchical planning. Key takeaways for implementation include:

Using binary reachability checks simplifies reward design and improves learning stability compared to distance metrics.
The min-based tree return formulation effectively encourages shorter, successful plans.
Bootstrapping with a value function is crucial for generalizing beyond fixed training horizons.
Conditional VAEs (CSR modules) can effectively constrain the action space (subgoal prediction) for the manager.
Targeted exploration rewarding novel transitions/paths (using CSR errors) generates more useful data than generic novelty rewards. Memory augmentation is needed for policies using time-lagged rewards.
The approach works with self-generated data, removing the need for expert demonstrations.
The resulting planner is efficient at inference time, suitable for real-time applications.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/OWW/status/1887300521511288936