Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents (2502.01956v2)

Published 4 Feb 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Hierarchical Reinforcement Learning (HRL) agents often struggle with long-horizon visual planning due to their reliance on error-prone distance metrics. We propose Discrete Hierarchical Planning (DHP), a method that replaces continuous distance estimates with discrete reachability checks to evaluate subgoal feasibility. DHP recursively constructs tree-structured plans by decomposing long-term goals into sequences of simpler subtasks, using a novel advantage estimation strategy that inherently rewards shorter plans and generalizes beyond training depths. In addition, to address the data efficiency challenge, we introduce an exploration strategy that generates targeted training examples for the planning modules without needing expert data. Experiments in 25-room navigation environments demonstrate $100\%$ success rate (vs $82\%$ baseline) and $73$-step average episode length (vs $158$-step baseline). The method also generalizes to momentum-based control tasks and requires only $\log N$ steps for replanning. Theoretical analysis and ablations validate our design choices.

Summary

  • The paper introduces a discrete reachability-based reward scheme that overcomes the challenges of continuous distance metrics in hierarchical planning.
  • The methodology decomposes tasks into binary trees using min-based returns and conditional VAEs to predict efficient subgoals.
  • Empirical results show DHP achieves a 99% success rate and efficient inference, outperforming state-of-the-art benchmarks in long-horizon navigation.

This paper introduces Discrete Hierarchical Planning (DHP), a novel method for training Hierarchical Reinforcement Learning (HRL) agents on long-horizon visual planning tasks. DHP addresses the limitations of traditional hierarchical planning methods that rely on learning continuous distance metrics between states, which can be difficult to learn accurately, especially with self-generated data where suboptimal policies can lead to erroneous distance estimates.

Instead of distance metrics, DHP proposes using a discrete, reachability-based reward scheme. The core idea is to determine if a subgoal is directly achievable by a low-level "worker" policy within a fixed number of steps (KK). This binary reachability check provides a clearer, more robust learning signal compared to continuous distances.

Methodology: Discrete Hierarchical Planning

  1. Hierarchical Planning Policy: A goal-conditioned planning policy πθ(sg1st,sg)\pi_\theta(s_{g1} | s_t, s_g) learns to predict an intermediate subgoal sg1s_{g1} given the current state sts_t and the final goal sgs_g.
  2. Plan Unrolling: This policy is applied recursively, decomposing the original task (st,sg)(s_t, s_g) into (st,sg1)(s_t, s_{g1}) and (sg1,sg)(s_{g1}, s_g), and further decomposing these subtasks. This creates a binary tree τ\tau where each node nin_i represents a subtask (sti,sgi)(s_{ti}, s_{gi}). The tree is unrolled up to a maximum depth DD.
  3. Discrete Rewards: For each node nin_i in the tree, the worker policy is simulated for KK steps starting from stis_{ti} with sgis_{gi} as its goal. If the worker's final state sfis_{fi} is sufficiently close to the subgoal sgis_{gi} (measured by cosine_max similarity > ΔR\Delta_R), the node is marked as reachable (terminal Ti=TrueT_i=True) and receives a reward Ri=1R_i = 1. Otherwise, Ri=0R_i = 0. Reachability propagates up the tree (TiT_i is true if its parent was already terminal).
  4. Advantage Estimation for Trees: To encourage finding reachable subgoals quickly (shorter plans/shallower trees), a novel return formulation is proposed. Instead of summing child returns, it takes the minimum, combined with discounting:

    Gi=min(R2i+1+γG2i+1,R2i+2+γG2i+2)G_i = \min(R_{2i+1} + \gamma G_{2i+1}, R_{2i+2} + \gamma G_{2i+2})

    This ensures high returns only if both branches eventually lead to reachable subgoals, and the discount factor γ\gamma penalizes deeper trees. Lambda returns (GiλG_i^\lambda) are used for variance reduction, incorporating a learned value function vϕv_\phi:

    Giλ=min(R2i+1+γ((1λ)vϕ(n2i+1)+λG2i+1λ),R2i+2+γ((1λ)vϕ(s2i+2)+λG2i+2λ))G_i^\lambda = \min(R_{2i+1} + \gamma((1-\lambda) v_\phi(n_{2i+1}) + \lambda G_{2i+1}^\lambda), R_{2i+2} + \gamma((1-\lambda) v_\phi(s_{2i+2}) + \lambda G_{2i+2}^\lambda))

    The value function vϕv_\phi is trained to predict these returns, and importantly, allows bootstrapping at non-terminal leaf nodes, enabling generalization beyond the training depth DD.

  5. Policy Gradients: Policy gradients are derived for the tree structure, allowing the planning policy πθ\pi_\theta to be updated using standard policy gradient methods (like SAC) with the calculated tree advantages (Ai=Giλvϕ(ni)A_i = G_i^\lambda - v_\phi(n_i)):

    θJ(θ)=Eτi=02D2Ai(τ)θlogπθ(aini)\nabla_\theta J(\theta) = \mathbb{E}_\tau \sum_{i=0}^{2^D-2} A^i(\tau) \nabla_\theta \log \pi_\theta(a_i|n_i)

    The policy loss includes an entropy term for exploration.

Agent Architecture

The agent employs an HRL architecture with three main components:

  1. Perception: Uses a Recurrent State Space Model (RSSM) \cite{hafner2019learning} to learn compact state representations sts_t from sequences of image observations (oto_t). This allows for efficient on-policy training using imagined rollouts. Static state representations are also learned via an MLP for single observations (needed for goal states).
  2. Worker: A goal-conditioned Soft Actor-Critic (SAC) agent trained to reach a given worker subgoal swgs_{wg} within KK steps. Its reward is the cosine_max similarity to swgs_{wg}.
  3. Manager (Planning Policy): The DHP planning policy πθ\pi_\theta, also implemented as a goal-conditioned SAC agent, trained using the tree-based returns and advantages described above. It operates in the latent space of Conditional State Recall modules.

Key Implementation Details

  • Conditional State Recall (CSR): To reduce the search space for subgoals, two Conditional VAEs (CVAEs) are used:
    • Goal-CSR (GCSR): Predicts a midway state st+q/2s_{t+q/2} given start sts_t and end st+qs_{t+q} states. The manager predicts a latent variable zz, and the GCSR decoder generates the subgoal state: s^t+q/2=DecG(st,st+q,z)\hat{s}_{t+q/2} = \text{Dec}_G(s_t, s_{t+q}, z). Trained on replay data triplets at multiple time scales q{2K,4K,...}q \in \{2K, 4K, ...\}.
    • Initial-CSR (ICSR): Predicts a nearby state st+Ks_{t+K} given the start state sts_t. The decoder generates the state: s^t+K=DecI(st,z)\hat{s}_{t+K} = \text{Dec}_I(s_t, z). Used for efficient reachability checks during inference and goal generation during exploration.
  • Plan Inference: At test time, only the first branch of the tree is unrolled recursively until a reachable subgoal is found (using ICSR for fast checking). The first reachable subgoal along this path is given to the worker. This takes logarithmic time (O(logN)O(\log N) for horizon NN). The agent can plan deeper (DInfD_{Inf}) than it was trained (DD).
  • Memory Augmented Exploration: A novel exploration strategy is proposed to collect relevant data. Instead of rewarding visits to novel states (which led to suboptimal trajectories), the agent is rewarded for traversing novel path segments and state transitions. Novelty is measured by the reconstruction error of the GCSR and ICSR modules.

    RtE=stDecI(stK,zI)2+qQstq/2DecG(stq,st,zG)2R^E_t = \Vert s_{t} - \text{Dec}_I(s_{t-K},z_I)\Vert^2 + \sum_{q \in Q} \Vert s_{t-q/2} - \text{Dec}_G(s_{t-q},s_{t},z_G) \Vert^2

    Since these rewards depend on past states, the exploratory policy πE\pi_E is augmented with a memory input memt={stK,st2K,...}\text{mem}_t = \{s_{t-K}, s_{t-2K}, ...\} containing relevant past states.

Results

  • Evaluated on a challenging 25-room visual navigation task requiring long-horizon planning (>150 steps average for previous SOTA).
  • DHP significantly outperforms previous benchmarks (GC BC, Visual Foresight, GCP \cite{pertsch2020long}), achieving a 99% success rate (vs. 82% for GCP) and much shorter average path lengths (71.37 vs. 158.06 for GCP).
  • Ablation studies confirm the benefits of the proposed memory-augmented exploration strategy over vanilla exploration and using expert data. The agent without memory performs worse than the full model but better than vanilla/expert data.
  • The agent demonstrates generalization beyond the maximum training tree depth (D=3D=3 performs as well as D=5D=5).
  • Alternative reward schemes show that a negative reward variant performs similarly well, while a distance-based sum approach performs poorly, validating the discrete reachability approach.
  • Inference is efficient, running at >200 Hz on an NVIDIA RTX 4090.

Practical Implications

DHP provides a practical alternative to distance-based hierarchical planning. Key takeaways for implementation include:

  • Using binary reachability checks simplifies reward design and improves learning stability compared to distance metrics.
  • The min-based tree return formulation effectively encourages shorter, successful plans.
  • Bootstrapping with a value function is crucial for generalizing beyond fixed training horizons.
  • Conditional VAEs (CSR modules) can effectively constrain the action space (subgoal prediction) for the manager.
  • Targeted exploration rewarding novel transitions/paths (using CSR errors) generates more useful data than generic novelty rewards. Memory augmentation is needed for policies using time-lagged rewards.
  • The approach works with self-generated data, removing the need for expert demonstrations.
  • The resulting planner is efficient at inference time, suitable for real-time applications.
X Twitter Logo Streamline Icon: https://streamlinehq.com