Shortcut Learning for Abstract Planning (SLAP)

Updated 9 November 2025

SLAP is a framework that combines task and motion planning abstractions with reinforcement learning to discover multi-step shortcut options in complex robotic domains.
It integrates human-engineered abstract actions with automatically learned policies to minimize plan lengths and boost success rates in deterministic and continuous MDP settings.
Empirical evaluations in various simulated environments demonstrate that SLAP consistently achieves 100% task success while significantly reducing planning times compared to pure planning methods.

Shortcut Learning for Abstract Planning (SLAP) refers to a family of frameworks and algorithmic strategies that leverage the abstraction capabilities of Task and Motion Planning (TAMP) and temporally-extended actions (options) to introduce learned "shortcut" skills into model-based or model-free long-horizon decision-making. By augmenting a fixed set of human-engineered abstract actions with automatically discovered multi-step policies, SLAP aims to minimize plan length, improve task success rate, and bridge the gap between combinatorial planning and reinforcement learning in high-dimensional, continuous, and sparse-reward robotic domains.

1. Formal Foundation: Abstractions and Task Decomposition

SLAP operates in the setting of deterministic or continuous-state, continuous-action Markov Decision Processes (MDPs) $\mathcal{M}=\left(\mathcal{X}, \mathcal{U}, f, R, \gamma, p_0\right)$ or their stochastic variants, with temporally extended actions—options—that provide hierarchical structure. The key ingredients are:

State space: $x \in \mathcal{X} \subset \mathbb{R}^n$
Action space: $u \in \mathcal{U} \subset \mathbb{R}^m$
Deterministic transitions: $x_{t+1} = f(x_t, u_t)$
Sparse reward: $R(x_t, u_t) = -1$ per time step; reaching a goal state $g \subset \mathcal{X}$ is non-penalized
Objective: Minimize episode length (cumulative cost)

In TAMP, the state space is abstracted using predicates so that $s = \operatorname{abstract}(x)$ , providing symbolic or geometrically meaningful state representations. Options are tuples $a = \langle s^{a}_{\text{init}}, \pi^a, s^{a}_{\text{term}} \rangle$ mapping initiation and termination abstract states to a low-level policy $\pi^a\!: \mathcal{X} \rightarrow \mathcal{U}$ .

The abstract planning problem builds a directed two-level graph $G = (V_{\text{low}}, V_{\text{high}}, E_{\text{low}}, E_{\text{high}})$ , highlighting the compositionality and reachability of hand-engineered as well as learned skills.

2. The Shortcut Learning Mechanism

SLAP introduces an algorithmic process for identifying and integrating new shortcut options:

Candidate Extraction: Search the abstract graph for pairs $(s_{\text{init}}, s_{\text{term}})$ not already directly connected by an option but reachable via a multi-step plan.
Pruning via Stochastic Rollouts: For each candidate, estimate via random rollouts the empirical reachability; only proceed to learn shortcuts where naïve exploration rarely discovers the transition.
Shortcut Policy Learning: For retained candidates, instantiate a new MDP where the goal is to reach $s_{\text{term}}$ from $s_{\text{init}}$ . Use model-free RL (typically PPO) to learn a policy $\pi_\theta$ under the same low-level environment dynamics and reward structure:

$J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{T-1} \gamma^t r(x_t, u_t) \right]$

Training terminates when $\operatorname{abstract}(x) = s_{\text{term}}$ .

Integration: Each learned shortcut $\hat{a} = \langle s_{\text{init}}, \pi_\theta, s_{\text{term}} \rangle$ is added to the set of available options, updating the abstract graph for planning.

This process is algorithmically captured as follows (pseudocode excerpt):

Algorithm: SLAP Training
Input: training tasks {(x₀,g)}, transition f, options 𝒜, N_rollout, T_rollout, K_rollout
1. For each training task:
   a. Build abstract graph G using 𝒜
   b. Extract candidate state pairs (s_i, s_j) not in 𝒜 but reachable in G
   c. For each candidate, run N_rollout rollouts of length ≤T_rollout
      If K_rollout reach s_j, retain candidate.
2. For each retained candidate, create MDP and train PPO policy π_{i,j}
3. Add new shortcut to 𝒜

3. Planning with Shortcuts: Integration and Execution

At test time, SLAP uses the augmented option set (hand-crafted and learned shortcuts) to construct the planning graph. Dijkstra’s algorithm is run on the low-level graph to search for the minimal-cost trajectory (in terms of time steps). Execution proceeds by sequentially triggering policies corresponding to the selected options. If a learned shortcut fails to reach its declared $s_{\text{term}}$ within a timeout $T_{\text{eval}}$ , the planner prunes the edge and replans on the fly.

The computational complexity is:

Candidate shortcut pairs: $O(|\mathcal{S}|^2)$
Graph construction: $O(|\text{Ops}|\cdot|\text{Objects}|^\text{arity})$
Planning: $O(|E|+|V|\log|V|)$ per Dijkstra search
RL training is parallelized over candidates; empirical pruning removes $\approx$ 99% of candidates

A property of SLAP is completeness: the system falls back to pure planning if no shortcuts are useful, and reduces to end-to-end RL if tasks are trivial.

4. Empirical Behavior: Examples and Quantitative Evaluation

SLAP is evaluated in four simulated, sparse-reward robotic environments:

Obstacle 2D: Planar robot; clear target region blocked by a single obstacle; 11 learned shortcuts.
Obstacle Tower: PyBullet Panda arm; stack of obstacles; 92 shortcuts.
Cluttered Drawer: Manipulator in a drawer with multiple obstacles; 74 shortcuts.
Cleanup Table: Manipulator, irregular toys, wiper tool; 54 shortcuts.

Learned shortcuts include "slap" (pushing stacks of objects), "wiggle" (oscillating the gripper to clear adjacent objects), and "wipe" (sweeping multiple objects in one motion), violating the STRIPS “single-object” assumption and enabling multi-object contacts.

A summary of plan length reduction is tabulated:

Environment	Method	Success	Plan Length	Reduction
Obstacle 2D	SLAP	100%	17.8±2.0	↓ 31%
	Pure Planning	100%	25.8±2.2	0%
	PPO	0%	100 (max)	N/A
Obstacle Tower	SLAP	100%	73.8±4.3	↓ 69%
	Pure Planning	100%	238.6±12.8	0%
Cleanup Table	SLAP	100%	113.7±17.0	↓ 66%
	Pure Planning	100%	446.3±34.9	0%
	RL baselines	0%	500	N/A

SLAP consistently achieves 100% success and substantially reduced plan lengths compared to pure planning or RL baselines. Hierarchical RL using the same options does not realize the plan optimality of SLAP.

Additional findings:

As training proceeds, more shortcuts are mastered and plan lengths monotonically decrease, up to 500K steps.
Generalization: SLAP trained on 3-block stacks generalizes plan lengths to unseen task instances (4–6 block towers, novel object sets) due to the reuse of abstract predicates and shortcut object-substitution schemes.

5. Theoretical Underpinnings and Relation to Abstract Model Learning

SLAP can be linked to frameworks that learn abstract world models over options (Rodriguez-Sanchez et al., 22 Jun 2024), where abstraction functions $\phi: S \rightarrow \tilde{S}$ are learned to preserve the option-induced transition and initiation structure. Exact dynamics-preserving abstractions enable value preservation; approximate versions guarantee bounded value loss:

$|Q^\pi(s, o) - Q^\pi(\phi(s), o)| \leq \frac{\sqrt{\epsilon_R} + \gamma V_\text{max}\sqrt{\epsilon_T}}{1-\gamma}$

This suggests that SLAP's methodology for learning and integrating shortcut options complements model-learning approaches by expanding the skill set over which abstractions can be constructed and planned.

Practical model-learning involves maximizing contrastive objectives (InfoNCE) to induce embeddings that capture option effects while compressing irrelevant state details. Planning then proceeds in the abstract MDP equipped with both hand-crafted and learned shortcut options, providing sample-efficient and generalizable policies even in continuous, high-dimensional environments.

6. Limitations and Future Research

Key limitations and open directions for SLAP include:

Dependence on fixed, user-provided abstractions (predicate definitions and high-level TAMP structure). Learning or refining abstractions jointly with options is a prospective extension.
The quadratic growth in candidate shortcut pairs; while aggressive empirical pruning alleviates this, scaling to very large abstract state spaces may require additional heuristics or learned prioritization schemes.
Applicability to stochastic or partially observable environments is not yet fully established, though preliminary findings indicate improved robustness when shortcut options are present.
Environments where skill effects overlap extensively may defeat abstraction, requiring more expressive or hierarchical representation mechanisms.
Integration of automated option discovery, continual learning of abstractions, and multi-modal perception (e.g., vision and tactile sensing) are promising future directions.

SLAP demonstrates that leveraging existing TAMP abstractions to guide reinforcement learning enables the discovery of dynamic, multi-object skills that drastically shorten plans and improve long-horizon task success, moving toward integrated planning-learning systems combining generalization with physical improvisation.

PDF Markdown Chat (Pro)

References (1)

Learning Abstract World Model for Value-preserving Planning with Options (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Shortcut Learning for Abstract Planning (SLAP).