- The paper introduces a discrete reachability-based reward scheme that overcomes the challenges of continuous distance metrics in hierarchical planning.
- The methodology decomposes tasks into binary trees using min-based returns and conditional VAEs to predict efficient subgoals.
- Empirical results show DHP achieves a 99% success rate and efficient inference, outperforming state-of-the-art benchmarks in long-horizon navigation.
This paper introduces Discrete Hierarchical Planning (DHP), a novel method for training Hierarchical Reinforcement Learning (HRL) agents on long-horizon visual planning tasks. DHP addresses the limitations of traditional hierarchical planning methods that rely on learning continuous distance metrics between states, which can be difficult to learn accurately, especially with self-generated data where suboptimal policies can lead to erroneous distance estimates.
Instead of distance metrics, DHP proposes using a discrete, reachability-based reward scheme. The core idea is to determine if a subgoal is directly achievable by a low-level "worker" policy within a fixed number of steps (K). This binary reachability check provides a clearer, more robust learning signal compared to continuous distances.
Methodology: Discrete Hierarchical Planning
- Hierarchical Planning Policy: A goal-conditioned planning policy πθ(sg1∣st,sg) learns to predict an intermediate subgoal sg1 given the current state st and the final goal sg.
- Plan Unrolling: This policy is applied recursively, decomposing the original task (st,sg) into (st,sg1) and (sg1,sg), and further decomposing these subtasks. This creates a binary tree τ where each node ni represents a subtask (sti,sgi). The tree is unrolled up to a maximum depth D.
- Discrete Rewards: For each node ni in the tree, the worker policy is simulated for K steps starting from sti with sgi as its goal. If the worker's final state sfi is sufficiently close to the subgoal sgi (measured by
cosine_max
similarity > ΔR), the node is marked as reachable (terminal Ti=True) and receives a reward Ri=1. Otherwise, Ri=0. Reachability propagates up the tree (Ti is true if its parent was already terminal).
- Advantage Estimation for Trees: To encourage finding reachable subgoals quickly (shorter plans/shallower trees), a novel return formulation is proposed. Instead of summing child returns, it takes the minimum, combined with discounting:
Gi=min(R2i+1+γG2i+1,R2i+2+γG2i+2)
This ensures high returns only if both branches eventually lead to reachable subgoals, and the discount factor γ penalizes deeper trees. Lambda returns (Giλ) are used for variance reduction, incorporating a learned value function vϕ:
Giλ=min(R2i+1+γ((1−λ)vϕ(n2i+1)+λG2i+1λ),R2i+2+γ((1−λ)vϕ(s2i+2)+λG2i+2λ))
The value function vϕ is trained to predict these returns, and importantly, allows bootstrapping at non-terminal leaf nodes, enabling generalization beyond the training depth D.
- Policy Gradients: Policy gradients are derived for the tree structure, allowing the planning policy πθ to be updated using standard policy gradient methods (like SAC) with the calculated tree advantages (Ai=Giλ−vϕ(ni)):
∇θJ(θ)=Eτi=0∑2D−2Ai(τ)∇θlogπθ(ai∣ni)
The policy loss includes an entropy term for exploration.
Agent Architecture
The agent employs an HRL architecture with three main components:
- Perception: Uses a Recurrent State Space Model (RSSM) \cite{hafner2019learning} to learn compact state representations st from sequences of image observations (ot). This allows for efficient on-policy training using imagined rollouts. Static state representations are also learned via an MLP for single observations (needed for goal states).
- Worker: A goal-conditioned Soft Actor-Critic (SAC) agent trained to reach a given worker subgoal swg within K steps. Its reward is the
cosine_max
similarity to swg.
- Manager (Planning Policy): The DHP planning policy πθ, also implemented as a goal-conditioned SAC agent, trained using the tree-based returns and advantages described above. It operates in the latent space of Conditional State Recall modules.
Key Implementation Details
- Conditional State Recall (CSR): To reduce the search space for subgoals, two Conditional VAEs (CVAEs) are used:
- Goal-CSR (GCSR): Predicts a midway state st+q/2 given start st and end st+q states. The manager predicts a latent variable z, and the GCSR decoder generates the subgoal state: s^t+q/2=DecG(st,st+q,z). Trained on replay data triplets at multiple time scales q∈{2K,4K,...}.
- Initial-CSR (ICSR): Predicts a nearby state st+K given the start state st. The decoder generates the state: s^t+K=DecI(st,z). Used for efficient reachability checks during inference and goal generation during exploration.
- Plan Inference: At test time, only the first branch of the tree is unrolled recursively until a reachable subgoal is found (using ICSR for fast checking). The first reachable subgoal along this path is given to the worker. This takes logarithmic time (O(logN) for horizon N). The agent can plan deeper (DInf) than it was trained (D).
- Memory Augmented Exploration: A novel exploration strategy is proposed to collect relevant data. Instead of rewarding visits to novel states (which led to suboptimal trajectories), the agent is rewarded for traversing novel path segments and state transitions. Novelty is measured by the reconstruction error of the GCSR and ICSR modules.
RtE=∥st−DecI(st−K,zI)∥2+q∈Q∑∥st−q/2−DecG(st−q,st,zG)∥2
Since these rewards depend on past states, the exploratory policy πE is augmented with a memory input memt={st−K,st−2K,...} containing relevant past states.
Results
- Evaluated on a challenging 25-room visual navigation task requiring long-horizon planning (>150 steps average for previous SOTA).
- DHP significantly outperforms previous benchmarks (GC BC, Visual Foresight, GCP \cite{pertsch2020long}), achieving a 99% success rate (vs. 82% for GCP) and much shorter average path lengths (71.37 vs. 158.06 for GCP).
- Ablation studies confirm the benefits of the proposed memory-augmented exploration strategy over vanilla exploration and using expert data. The agent without memory performs worse than the full model but better than vanilla/expert data.
- The agent demonstrates generalization beyond the maximum training tree depth (D=3 performs as well as D=5).
- Alternative reward schemes show that a negative reward variant performs similarly well, while a distance-based sum approach performs poorly, validating the discrete reachability approach.
- Inference is efficient, running at >200 Hz on an NVIDIA RTX 4090.
Practical Implications
DHP provides a practical alternative to distance-based hierarchical planning. Key takeaways for implementation include:
- Using binary reachability checks simplifies reward design and improves learning stability compared to distance metrics.
- The min-based tree return formulation effectively encourages shorter, successful plans.
- Bootstrapping with a value function is crucial for generalizing beyond fixed training horizons.
- Conditional VAEs (CSR modules) can effectively constrain the action space (subgoal prediction) for the manager.
- Targeted exploration rewarding novel transitions/paths (using CSR errors) generates more useful data than generic novelty rewards. Memory augmentation is needed for policies using time-lagged rewards.
- The approach works with self-generated data, removing the need for expert demonstrations.
- The resulting planner is efficient at inference time, suitable for real-time applications.