Recursive Tree Planners

Updated 25 February 2026

Recursive Tree Planners (RTP) is a hierarchical framework that combines explicit forward search with policy learning through recursive, Dijkstra-style tree traversal.
They interpolate between exhaustive planning and greedy execution using generalized actions and imitation-based updates to improve solution speed and robustness.
Experimental results show significant gains in planning efficiency, zero-shot transfer, and adaptability across varied domains such as Box2D and MuJoCo.

A Recursive Tree Planner (RTP) is a hierarchical planning and control framework designed to unify forward search and policy optimization within a recursive, generalized Dijkstra-style search tree. RTPs interpolate between pure planning (full forward search) and pure greedy execution (policy-only), while supporting hierarchical composition, imitation-based policy learning, and robust zero-shot transfer by leveraging previously solved sub-tasks as generalized actions. The design integrates both classical planning and modern policy learning paradigms through iterative refinement and recursive structure (Redlich, 2024).

1. Formal Framework of Recursive Tree Planners

An RTP operates in a formal world $W = \{S, U, f, C_W\}$ :

$S$ : state space (finite or continuous)
$U(s)\subseteq U$ : primitive action set in state $s$
$f\,:\,U\times S \to S$ : deterministic transition function
$C_W$ : a family of problem classes sharing $S, U, f$

Each planning problem class $C\in C_W$ is given by $(r, S_G)$ , where $r\,:\,U\times S \to \mathbb{R}$ is an immediate reward function and $S_G\subseteq S$ is the goal set. A plan $\pi$ is a sequence of $(u_i, s_i)$ pairs with $s_{i+1} = f(u_i, s_i)$ , and cumulative reward $R(\pi) = \sum_i r(u_i,s_i)$ . A policy $p(a|s)$ is a (possibly stochastic) mapping from state to actions, including higher-level “generalized actions” (GA), $a\in U\cup\widetilde{A}$ .

RTP supports pure planning—an exhaustive search across $f$ and $r$ without any learned policy—and pure greedy execution, where $u = \arg\max_u p(u|s)$ at each time step. Between these extremes, RTP interpolates using near-greedy search, guided by partial policies at all levels.

For each leaf or intermediate search node $s$ in tree $T$ , the planner expands pairs $(s^*, u^*) = \arg\max_{s\in T,\,u\in V(s)} p(u|s)$ , where $V(s)$ is the set of unexpanded actions at $s$ . In absence of a policy, uniform or heuristic priorities are used.

2. Hierarchical and Recursive Decomposition

RTP is fundamentally recursive, supporting hierarchical planning by composing primitive and generalized actions at every level. Each planner invocation $RTP(a_{\mathrm{init}}, s_{\mathrm{init}})$ operates at a level $\ell$ with parameters:

$A_\ell \subseteq U \cup \widetilde{A}_\ell$ : allowed actions (primitive and GA)
$S_{g_\ell}$ : local sub-goal set
$p_\ell(\cdot| \cdot)$ : level- $\ell$ policy (learned or externally specified)
$init_\ell(\cdot)$ : parameterizes $A_\ell$ as a function of state
$\gamma_\ell$ : resource budget (max depth, node expansions)

A GA $a\in\widetilde{A}_\ell$ is a reusable bundle $(A_{\ell-1}, S_{g_{\ell-1}}, p_{\ell-1}, init_{\ell-1}, \gamma_{\ell-1})$ , allowing recursive invocation of RTP for subproblems. At any level, the planner may intermix primitive steps with sub-searches (GAs), refine GAs by supplementing with further primitive actions, and process multiple sub-goals or boundary states returned from lower levels.

3. Near-Greedy Generalized Dijkstra Search

The flat tree planner (TP) and its recursive extension (RTP) perform forward search by maintaining, for each node $s$ , the cumulative reward $R(s)$ and an optional heuristic $h(s)\geq 0$ , forming priority $f(s) = c(s) + h(s)$ with $c(s) = -R(s)$ . In practice, many RTP experiments set $h(s)=0$ (uniform-cost search), but admissible heuristics are permitted.

At each expansion, the planner selects the next $(s, u)$ maximizing $p(u|s)$ , removes $u$ from $V(s)$ , simulates the transition $s' = f(u,s)$ , updates rewards, and possibly rewires ancestors if a higher reward path is found. In RTP, $u$ may be a primitive or generalized action; GAs trigger recursive calls, possibly returning multiple sub-goals and boundary states, which are added to the tree for higher-level exploration.

Near-greedy enumeration ensures that search first follows the greedy policy, incrementally exploring less likely alternatives, yielding efficiency as policies improve.

4. Policy Learning and Virtuous Cycle

Policy learning in RTP hinges on a plan–learn–plan loop (PLP). The process involves:

Running RTP (with current policies $P^{(k)}$ ) to solve multiple training instances.
Logging all $(s,a)$ pairs encountered in each call, including at all levels of recursion.
Training $p(a|s)$ using cross-entropy or, alternatively, a Gumbel-Softmax policy network with MSE on the one-hot index.

The process iterates: policies are initialized randomly or uniformly, then improved through imitation of planner-generated solutions. As sub-policies become better, planning speed and solution quality increase, eventually enabling greedy execution (policy-only) for tractable instances.

5. Sub-goal and Boundary-State Generation

Each generalized action operates with its own goal set $S_g$ . When RTP $(a(g), s)$ solves for any $s'\in S_g$ , all such sub-goals are returned to the parent. If infeasible transitions (e.g., to forbidden states) are detected, the planner records the last valid state as a boundary state and backtracks, affording the higher level planner the ability to expand around newly identified frontiers. The return of multiple sub-goals and boundary states at every level enhances coverage efficiency and search flexibility.

6. Invariance Mechanisms and Policy Composition

RTP enables architectural invariances that generalize policies across object count and clutter:

Background invariance: Applied by filtering state inputs $f(s,n)\to s_f$ prior to policy evaluation, making policies for actions like grab $(n)$ invariant to distractor clutter.
Object-number invariance: Achieved by encoding states and actions via 2D image grids $I(x,y)$ and learning policies over softmax heatmaps; this generalizes the policy from $N$ objects to arbitrary object counts.
Policy splitting: For compositional tasks (e.g., "grab" then "place"), two networks are trained: $p_\text{task}(j|s)$ for the task, $p_j(k_j|s)$ for subtask entity (e.g., which object), with $p(j, k_j|s) = p_\text{task}(j|s) \cdot p_j(k_j|s) / Z(j)$ .

These mechanisms enable compact and adaptable sub-policies, facilitating transfer across different task complexities and layouts.

7. Training Regimes, Transfer, and Empirical Results

Policies can be trained in parallel (batch learning at all hierarchical levels) or staged (bottom-up, low-level first), with both approaches empirically supported. RTP achieves zero-shot transfer: policies $P^{(k)}$ trained on class $C_i$ can be reused for a novel class $C_j$ , as the hierarchy and boundary-state logic allow adaptation in real time to new obstacles or tasks, without retraining.

Experimental evaluation on Box2D (block stacking, grab/place, lunar lander) and MuJoCo (inverted pendulum) domains demonstrates:

Exact or rapidly converging solution quality as policies bootstrap from pure planning to pure greedy execution.
Orders-of-magnitude reduction in planning time from hierarchical structuring and near-greedy search.
Natural zero-shot transfer between task variants (e.g., stacking more blocks, changing obstacles).
Robustness and generalization via background and object-number invariant policy architectures.

Representative Experimental Results

Task	Planner	Policy?	Accuracy	Time (s)
Place 3 boxes (no obs)	Flat	No	15/25	64.0
Place 3 boxes	Flat	Yes	25/25	0.4
Place 9 boxes w/ obs	Flat w/ 3-box Policy	Yes	25/25	1.2
Stack 5 boxes	2-Level (L1+L2)	Yes	10/10	7.2
Inverted Pendulum (300 ex)	2-Level RTP	Yes	10/10	2.5

In the "zero-shot" regime, RTP trained on one variant (e.g., 3-box stack) successfully generalized to harder variants (e.g., 5-box stack) with minimal degradation in efficiency or success rate. Four-Rooms and Lunar Lander tasks further confirmed these capabilities (Redlich, 2024).

8. Connections to Differentiable Recursive Tree Models

TreeQN and ATreeC instantiate RTPs within deep RL via end-to-end differentiable tree expansions in abstract latent spaces. The components comprise:

State encoder $\phi:\mathcal{S}\rightarrow \mathbb{R}^d$
Action-conditioned transition model $f(\phi, a)$
Immediate reward predictor $\hat r(\phi,a)$
Leaf value head $V(\phi)$

The tree expansion is fully recursive, aggregates predicted rewards and values:

$Q^{(D)}(\phi_0, a) = \hat r(\phi_0, a) + \gamma \max_{a'} Q^{(D-1)}(f(\phi_0, a), a')$

ATreeC converts root $Q$ -values to policies via softmax, and all parameters are trained via RL objectives (n-step $Q$ -learning, actor-critic losses), plus auxiliary reward prediction (Farquhar et al., 2017).

Distinctive aspects of these differentiable RTPs include:

Abstract latent-space planning (not observation or real state)
Gradients propagated through entire planning tree
Elimination of rollout mismatch between model learning and planning
Explicit horizon/capacity trade-off enforced by recursive depth

Empirical evaluations of TreeQN/ATreeC show consistent improvements over n-step DQN/A2C baselines across a range of discrete-action benchmarks, confirming the efficacy of recursive tree structures in deep RL (Farquhar et al., 2017).

Recursive Tree Planners, through their combined explicit hierarchical planning, recursive invocation, policy learning, and invariance mechanisms, constitute a general framework that subsumes both classic forward search and policy-based control, enabling scalable planning, rapid transfer, and efficient operation across diverse domains (Redlich, 2024, Farquhar et al., 2017).

Markdown Report Issue Upgrade to Chat

References (2)

Pure Planning to Pure Policies and In Between with a Recursive Tree Planner (2024)

TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recursive Tree Planners (RTP).