Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recursive Tree Planners

Updated 25 February 2026
  • Recursive Tree Planners (RTP) is a hierarchical framework that combines explicit forward search with policy learning through recursive, Dijkstra-style tree traversal.
  • They interpolate between exhaustive planning and greedy execution using generalized actions and imitation-based updates to improve solution speed and robustness.
  • Experimental results show significant gains in planning efficiency, zero-shot transfer, and adaptability across varied domains such as Box2D and MuJoCo.

A Recursive Tree Planner (RTP) is a hierarchical planning and control framework designed to unify forward search and policy optimization within a recursive, generalized Dijkstra-style search tree. RTPs interpolate between pure planning (full forward search) and pure greedy execution (policy-only), while supporting hierarchical composition, imitation-based policy learning, and robust zero-shot transfer by leveraging previously solved sub-tasks as generalized actions. The design integrates both classical planning and modern policy learning paradigms through iterative refinement and recursive structure (Redlich, 2024).

1. Formal Framework of Recursive Tree Planners

An RTP operates in a formal world W={S,U,f,CW}W = \{S, U, f, C_W\}:

  • SS: state space (finite or continuous)
  • U(s)UU(s)\subseteq U: primitive action set in state ss
  • f:U×SSf\,:\,U\times S \to S: deterministic transition function
  • CWC_W: a family of problem classes sharing S,U,fS, U, f

Each planning problem class CCWC\in C_W is given by (r,SG)(r, S_G), where r:U×SRr\,:\,U\times S \to \mathbb{R} is an immediate reward function and SGSS_G\subseteq S is the goal set. A plan π\pi is a sequence of (ui,si)(u_i, s_i) pairs with si+1=f(ui,si)s_{i+1} = f(u_i, s_i), and cumulative reward R(π)=ir(ui,si)R(\pi) = \sum_i r(u_i,s_i). A policy p(as)p(a|s) is a (possibly stochastic) mapping from state to actions, including higher-level “generalized actions” (GA), aUA~a\in U\cup\widetilde{A}.

RTP supports pure planning—an exhaustive search across ff and rr without any learned policy—and pure greedy execution, where u=argmaxup(us)u = \arg\max_u p(u|s) at each time step. Between these extremes, RTP interpolates using near-greedy search, guided by partial policies at all levels.

For each leaf or intermediate search node ss in tree TT, the planner expands pairs (s,u)=argmaxsT,uV(s)p(us)(s^*, u^*) = \arg\max_{s\in T,\,u\in V(s)} p(u|s), where V(s)V(s) is the set of unexpanded actions at ss. In absence of a policy, uniform or heuristic priorities are used.

2. Hierarchical and Recursive Decomposition

RTP is fundamentally recursive, supporting hierarchical planning by composing primitive and generalized actions at every level. Each planner invocation RTP(ainit,sinit)RTP(a_{\mathrm{init}}, s_{\mathrm{init}}) operates at a level \ell with parameters:

  • AUA~A_\ell \subseteq U \cup \widetilde{A}_\ell: allowed actions (primitive and GA)
  • SgS_{g_\ell}: local sub-goal set
  • p()p_\ell(\cdot| \cdot): level-\ell policy (learned or externally specified)
  • init()init_\ell(\cdot): parameterizes AA_\ell as a function of state
  • γ\gamma_\ell: resource budget (max depth, node expansions)

A GA aA~a\in\widetilde{A}_\ell is a reusable bundle (A1,Sg1,p1,init1,γ1)(A_{\ell-1}, S_{g_{\ell-1}}, p_{\ell-1}, init_{\ell-1}, \gamma_{\ell-1}), allowing recursive invocation of RTP for subproblems. At any level, the planner may intermix primitive steps with sub-searches (GAs), refine GAs by supplementing with further primitive actions, and process multiple sub-goals or boundary states returned from lower levels.

The flat tree planner (TP) and its recursive extension (RTP) perform forward search by maintaining, for each node ss, the cumulative reward R(s)R(s) and an optional heuristic h(s)0h(s)\geq 0, forming priority f(s)=c(s)+h(s)f(s) = c(s) + h(s) with c(s)=R(s)c(s) = -R(s). In practice, many RTP experiments set h(s)=0h(s)=0 (uniform-cost search), but admissible heuristics are permitted.

At each expansion, the planner selects the next (s,u)(s, u) maximizing p(us)p(u|s), removes uu from V(s)V(s), simulates the transition s=f(u,s)s' = f(u,s), updates rewards, and possibly rewires ancestors if a higher reward path is found. In RTP, uu may be a primitive or generalized action; GAs trigger recursive calls, possibly returning multiple sub-goals and boundary states, which are added to the tree for higher-level exploration.

Near-greedy enumeration ensures that search first follows the greedy policy, incrementally exploring less likely alternatives, yielding efficiency as policies improve.

4. Policy Learning and Virtuous Cycle

Policy learning in RTP hinges on a plan–learn–plan loop (PLP). The process involves:

  1. Running RTP (with current policies P(k)P^{(k)}) to solve multiple training instances.
  2. Logging all (s,a)(s,a) pairs encountered in each call, including at all levels of recursion.
  3. Training p(as)p(a|s) using cross-entropy or, alternatively, a Gumbel-Softmax policy network with MSE on the one-hot index.

The process iterates: policies are initialized randomly or uniformly, then improved through imitation of planner-generated solutions. As sub-policies become better, planning speed and solution quality increase, eventually enabling greedy execution (policy-only) for tractable instances.

5. Sub-goal and Boundary-State Generation

Each generalized action operates with its own goal set SgS_g. When RTP(a(g),s)(a(g), s) solves for any sSgs'\in S_g, all such sub-goals are returned to the parent. If infeasible transitions (e.g., to forbidden states) are detected, the planner records the last valid state as a boundary state and backtracks, affording the higher level planner the ability to expand around newly identified frontiers. The return of multiple sub-goals and boundary states at every level enhances coverage efficiency and search flexibility.

6. Invariance Mechanisms and Policy Composition

RTP enables architectural invariances that generalize policies across object count and clutter:

  • Background invariance: Applied by filtering state inputs f(s,n)sff(s,n)\to s_f prior to policy evaluation, making policies for actions like grab(n)(n) invariant to distractor clutter.
  • Object-number invariance: Achieved by encoding states and actions via 2D image grids I(x,y)I(x,y) and learning policies over softmax heatmaps; this generalizes the policy from NN objects to arbitrary object counts.
  • Policy splitting: For compositional tasks (e.g., "grab" then "place"), two networks are trained: ptask(js)p_\text{task}(j|s) for the task, pj(kjs)p_j(k_j|s) for subtask entity (e.g., which object), with p(j,kjs)=ptask(js)pj(kjs)/Z(j)p(j, k_j|s) = p_\text{task}(j|s) \cdot p_j(k_j|s) / Z(j).

These mechanisms enable compact and adaptable sub-policies, facilitating transfer across different task complexities and layouts.

7. Training Regimes, Transfer, and Empirical Results

Policies can be trained in parallel (batch learning at all hierarchical levels) or staged (bottom-up, low-level first), with both approaches empirically supported. RTP achieves zero-shot transfer: policies P(k)P^{(k)} trained on class CiC_i can be reused for a novel class CjC_j, as the hierarchy and boundary-state logic allow adaptation in real time to new obstacles or tasks, without retraining.

Experimental evaluation on Box2D (block stacking, grab/place, lunar lander) and MuJoCo (inverted pendulum) domains demonstrates:

  • Exact or rapidly converging solution quality as policies bootstrap from pure planning to pure greedy execution.
  • Orders-of-magnitude reduction in planning time from hierarchical structuring and near-greedy search.
  • Natural zero-shot transfer between task variants (e.g., stacking more blocks, changing obstacles).
  • Robustness and generalization via background and object-number invariant policy architectures.

Representative Experimental Results

Task Planner Policy? Accuracy Time (s)
Place 3 boxes (no obs) Flat No 15/25 64.0
Place 3 boxes Flat Yes 25/25 0.4
Place 9 boxes w/ obs Flat w/ 3-box Policy Yes 25/25 1.2
Stack 5 boxes 2-Level (L1+L2) Yes 10/10 7.2
Inverted Pendulum (300 ex) 2-Level RTP Yes 10/10 2.5

In the "zero-shot" regime, RTP trained on one variant (e.g., 3-box stack) successfully generalized to harder variants (e.g., 5-box stack) with minimal degradation in efficiency or success rate. Four-Rooms and Lunar Lander tasks further confirmed these capabilities (Redlich, 2024).

8. Connections to Differentiable Recursive Tree Models

TreeQN and ATreeC instantiate RTPs within deep RL via end-to-end differentiable tree expansions in abstract latent spaces. The components comprise:

  • State encoder ϕ:SRd\phi:\mathcal{S}\rightarrow \mathbb{R}^d
  • Action-conditioned transition model f(ϕ,a)f(\phi, a)
  • Immediate reward predictor r^(ϕ,a)\hat r(\phi,a)
  • Leaf value head V(ϕ)V(\phi)

The tree expansion is fully recursive, aggregates predicted rewards and values:

Q(D)(ϕ0,a)=r^(ϕ0,a)+γmaxaQ(D1)(f(ϕ0,a),a)Q^{(D)}(\phi_0, a) = \hat r(\phi_0, a) + \gamma \max_{a'} Q^{(D-1)}(f(\phi_0, a), a')

ATreeC converts root QQ-values to policies via softmax, and all parameters are trained via RL objectives (n-step QQ-learning, actor-critic losses), plus auxiliary reward prediction (Farquhar et al., 2017).

Distinctive aspects of these differentiable RTPs include:

  • Abstract latent-space planning (not observation or real state)
  • Gradients propagated through entire planning tree
  • Elimination of rollout mismatch between model learning and planning
  • Explicit horizon/capacity trade-off enforced by recursive depth

Empirical evaluations of TreeQN/ATreeC show consistent improvements over n-step DQN/A2C baselines across a range of discrete-action benchmarks, confirming the efficacy of recursive tree structures in deep RL (Farquhar et al., 2017).


Recursive Tree Planners, through their combined explicit hierarchical planning, recursive invocation, policy learning, and invariance mechanisms, constitute a general framework that subsumes both classic forward search and policy-based control, enabling scalable planning, rapid transfer, and efficient operation across diverse domains (Redlich, 2024, Farquhar et al., 2017).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recursive Tree Planners (RTP).