Recursive Tree Planners
- Recursive Tree Planners (RTP) is a hierarchical framework that combines explicit forward search with policy learning through recursive, Dijkstra-style tree traversal.
- They interpolate between exhaustive planning and greedy execution using generalized actions and imitation-based updates to improve solution speed and robustness.
- Experimental results show significant gains in planning efficiency, zero-shot transfer, and adaptability across varied domains such as Box2D and MuJoCo.
A Recursive Tree Planner (RTP) is a hierarchical planning and control framework designed to unify forward search and policy optimization within a recursive, generalized Dijkstra-style search tree. RTPs interpolate between pure planning (full forward search) and pure greedy execution (policy-only), while supporting hierarchical composition, imitation-based policy learning, and robust zero-shot transfer by leveraging previously solved sub-tasks as generalized actions. The design integrates both classical planning and modern policy learning paradigms through iterative refinement and recursive structure (Redlich, 2024).
1. Formal Framework of Recursive Tree Planners
An RTP operates in a formal world :
- : state space (finite or continuous)
- : primitive action set in state
- : deterministic transition function
- : a family of problem classes sharing
Each planning problem class is given by , where is an immediate reward function and is the goal set. A plan is a sequence of pairs with , and cumulative reward . A policy is a (possibly stochastic) mapping from state to actions, including higher-level “generalized actions” (GA), .
RTP supports pure planning—an exhaustive search across and without any learned policy—and pure greedy execution, where at each time step. Between these extremes, RTP interpolates using near-greedy search, guided by partial policies at all levels.
For each leaf or intermediate search node in tree , the planner expands pairs , where is the set of unexpanded actions at . In absence of a policy, uniform or heuristic priorities are used.
2. Hierarchical and Recursive Decomposition
RTP is fundamentally recursive, supporting hierarchical planning by composing primitive and generalized actions at every level. Each planner invocation operates at a level with parameters:
- : allowed actions (primitive and GA)
- : local sub-goal set
- : level- policy (learned or externally specified)
- : parameterizes as a function of state
- : resource budget (max depth, node expansions)
A GA is a reusable bundle , allowing recursive invocation of RTP for subproblems. At any level, the planner may intermix primitive steps with sub-searches (GAs), refine GAs by supplementing with further primitive actions, and process multiple sub-goals or boundary states returned from lower levels.
3. Near-Greedy Generalized Dijkstra Search
The flat tree planner (TP) and its recursive extension (RTP) perform forward search by maintaining, for each node , the cumulative reward and an optional heuristic , forming priority with . In practice, many RTP experiments set (uniform-cost search), but admissible heuristics are permitted.
At each expansion, the planner selects the next maximizing , removes from , simulates the transition , updates rewards, and possibly rewires ancestors if a higher reward path is found. In RTP, may be a primitive or generalized action; GAs trigger recursive calls, possibly returning multiple sub-goals and boundary states, which are added to the tree for higher-level exploration.
Near-greedy enumeration ensures that search first follows the greedy policy, incrementally exploring less likely alternatives, yielding efficiency as policies improve.
4. Policy Learning and Virtuous Cycle
Policy learning in RTP hinges on a plan–learn–plan loop (PLP). The process involves:
- Running RTP (with current policies ) to solve multiple training instances.
- Logging all pairs encountered in each call, including at all levels of recursion.
- Training using cross-entropy or, alternatively, a Gumbel-Softmax policy network with MSE on the one-hot index.
The process iterates: policies are initialized randomly or uniformly, then improved through imitation of planner-generated solutions. As sub-policies become better, planning speed and solution quality increase, eventually enabling greedy execution (policy-only) for tractable instances.
5. Sub-goal and Boundary-State Generation
Each generalized action operates with its own goal set . When RTP solves for any , all such sub-goals are returned to the parent. If infeasible transitions (e.g., to forbidden states) are detected, the planner records the last valid state as a boundary state and backtracks, affording the higher level planner the ability to expand around newly identified frontiers. The return of multiple sub-goals and boundary states at every level enhances coverage efficiency and search flexibility.
6. Invariance Mechanisms and Policy Composition
RTP enables architectural invariances that generalize policies across object count and clutter:
- Background invariance: Applied by filtering state inputs prior to policy evaluation, making policies for actions like grab invariant to distractor clutter.
- Object-number invariance: Achieved by encoding states and actions via 2D image grids and learning policies over softmax heatmaps; this generalizes the policy from objects to arbitrary object counts.
- Policy splitting: For compositional tasks (e.g., "grab" then "place"), two networks are trained: for the task, for subtask entity (e.g., which object), with .
These mechanisms enable compact and adaptable sub-policies, facilitating transfer across different task complexities and layouts.
7. Training Regimes, Transfer, and Empirical Results
Policies can be trained in parallel (batch learning at all hierarchical levels) or staged (bottom-up, low-level first), with both approaches empirically supported. RTP achieves zero-shot transfer: policies trained on class can be reused for a novel class , as the hierarchy and boundary-state logic allow adaptation in real time to new obstacles or tasks, without retraining.
Experimental evaluation on Box2D (block stacking, grab/place, lunar lander) and MuJoCo (inverted pendulum) domains demonstrates:
- Exact or rapidly converging solution quality as policies bootstrap from pure planning to pure greedy execution.
- Orders-of-magnitude reduction in planning time from hierarchical structuring and near-greedy search.
- Natural zero-shot transfer between task variants (e.g., stacking more blocks, changing obstacles).
- Robustness and generalization via background and object-number invariant policy architectures.
Representative Experimental Results
| Task | Planner | Policy? | Accuracy | Time (s) |
|---|---|---|---|---|
| Place 3 boxes (no obs) | Flat | No | 15/25 | 64.0 |
| Place 3 boxes | Flat | Yes | 25/25 | 0.4 |
| Place 9 boxes w/ obs | Flat w/ 3-box Policy | Yes | 25/25 | 1.2 |
| Stack 5 boxes | 2-Level (L1+L2) | Yes | 10/10 | 7.2 |
| Inverted Pendulum (300 ex) | 2-Level RTP | Yes | 10/10 | 2.5 |
In the "zero-shot" regime, RTP trained on one variant (e.g., 3-box stack) successfully generalized to harder variants (e.g., 5-box stack) with minimal degradation in efficiency or success rate. Four-Rooms and Lunar Lander tasks further confirmed these capabilities (Redlich, 2024).
8. Connections to Differentiable Recursive Tree Models
TreeQN and ATreeC instantiate RTPs within deep RL via end-to-end differentiable tree expansions in abstract latent spaces. The components comprise:
- State encoder
- Action-conditioned transition model
- Immediate reward predictor
- Leaf value head
The tree expansion is fully recursive, aggregates predicted rewards and values:
ATreeC converts root -values to policies via softmax, and all parameters are trained via RL objectives (n-step -learning, actor-critic losses), plus auxiliary reward prediction (Farquhar et al., 2017).
Distinctive aspects of these differentiable RTPs include:
- Abstract latent-space planning (not observation or real state)
- Gradients propagated through entire planning tree
- Elimination of rollout mismatch between model learning and planning
- Explicit horizon/capacity trade-off enforced by recursive depth
Empirical evaluations of TreeQN/ATreeC show consistent improvements over n-step DQN/A2C baselines across a range of discrete-action benchmarks, confirming the efficacy of recursive tree structures in deep RL (Farquhar et al., 2017).
Recursive Tree Planners, through their combined explicit hierarchical planning, recursive invocation, policy learning, and invariance mechanisms, constitute a general framework that subsumes both classic forward search and policy-based control, enabling scalable planning, rapid transfer, and efficient operation across diverse domains (Redlich, 2024, Farquhar et al., 2017).