Papers
Topics
Authors
Recent
2000 character limit reached

Shortcut Learning for Abstract Planning (SLAP)

Updated 9 November 2025
  • SLAP is a framework that combines task and motion planning abstractions with reinforcement learning to discover multi-step shortcut options in complex robotic domains.
  • It integrates human-engineered abstract actions with automatically learned policies to minimize plan lengths and boost success rates in deterministic and continuous MDP settings.
  • Empirical evaluations in various simulated environments demonstrate that SLAP consistently achieves 100% task success while significantly reducing planning times compared to pure planning methods.

Shortcut Learning for Abstract Planning (SLAP) refers to a family of frameworks and algorithmic strategies that leverage the abstraction capabilities of Task and Motion Planning (TAMP) and temporally-extended actions (options) to introduce learned "shortcut" skills into model-based or model-free long-horizon decision-making. By augmenting a fixed set of human-engineered abstract actions with automatically discovered multi-step policies, SLAP aims to minimize plan length, improve task success rate, and bridge the gap between combinatorial planning and reinforcement learning in high-dimensional, continuous, and sparse-reward robotic domains.

1. Formal Foundation: Abstractions and Task Decomposition

SLAP operates in the setting of deterministic or continuous-state, continuous-action Markov Decision Processes (MDPs) M=(X,U,f,R,γ,p0)\mathcal{M}=\left(\mathcal{X}, \mathcal{U}, f, R, \gamma, p_0\right) or their stochastic variants, with temporally extended actions—options—that provide hierarchical structure. The key ingredients are:

  • State space: xXRnx \in \mathcal{X} \subset \mathbb{R}^n
  • Action space: uURmu \in \mathcal{U} \subset \mathbb{R}^m
  • Deterministic transitions: xt+1=f(xt,ut)x_{t+1} = f(x_t, u_t)
  • Sparse reward: R(xt,ut)=1R(x_t, u_t) = -1 per time step; reaching a goal state gXg \subset \mathcal{X} is non-penalized
  • Objective: Minimize episode length (cumulative cost)

In TAMP, the state space is abstracted using predicates so that s=abstract(x)s = \operatorname{abstract}(x), providing symbolic or geometrically meaningful state representations. Options are tuples a=sinita,πa,stermaa = \langle s^{a}_{\text{init}}, \pi^a, s^{a}_{\text{term}} \rangle mapping initiation and termination abstract states to a low-level policy πa ⁣:XU\pi^a\!: \mathcal{X} \rightarrow \mathcal{U}.

The abstract planning problem builds a directed two-level graph G=(Vlow,Vhigh,Elow,Ehigh)G = (V_{\text{low}}, V_{\text{high}}, E_{\text{low}}, E_{\text{high}}), highlighting the compositionality and reachability of hand-engineered as well as learned skills.

2. The Shortcut Learning Mechanism

SLAP introduces an algorithmic process for identifying and integrating new shortcut options:

  1. Candidate Extraction: Search the abstract graph for pairs (sinit,sterm)(s_{\text{init}}, s_{\text{term}}) not already directly connected by an option but reachable via a multi-step plan.
  2. Pruning via Stochastic Rollouts: For each candidate, estimate via random rollouts the empirical reachability; only proceed to learn shortcuts where naïve exploration rarely discovers the transition.
  3. Shortcut Policy Learning: For retained candidates, instantiate a new MDP where the goal is to reach sterms_{\text{term}} from sinits_{\text{init}}. Use model-free RL (typically PPO) to learn a policy πθ\pi_\theta under the same low-level environment dynamics and reward structure:

J(θ)=Eπθ[t=0T1γtr(xt,ut)]J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{T-1} \gamma^t r(x_t, u_t) \right]

Training terminates when abstract(x)=sterm\operatorname{abstract}(x) = s_{\text{term}}.

  1. Integration: Each learned shortcut a^=sinit,πθ,sterm\hat{a} = \langle s_{\text{init}}, \pi_\theta, s_{\text{term}} \rangle is added to the set of available options, updating the abstract graph for planning.

This process is algorithmically captured as follows (pseudocode excerpt):

1
2
3
4
5
6
7
8
9
Algorithm: SLAP Training
Input: training tasks {(x₀,g)}, transition f, options 𝒜, N_rollout, T_rollout, K_rollout
1. For each training task:
   a. Build abstract graph G using 𝒜
   b. Extract candidate state pairs (s_i, s_j) not in 𝒜 but reachable in G
   c. For each candidate, run N_rollout rollouts of length ≤T_rollout
      If K_rollout reach s_j, retain candidate.
2. For each retained candidate, create MDP and train PPO policy π_{i,j}
3. Add new shortcut to 𝒜

3. Planning with Shortcuts: Integration and Execution

At test time, SLAP uses the augmented option set (hand-crafted and learned shortcuts) to construct the planning graph. Dijkstra’s algorithm is run on the low-level graph to search for the minimal-cost trajectory (in terms of time steps). Execution proceeds by sequentially triggering policies corresponding to the selected options. If a learned shortcut fails to reach its declared sterms_{\text{term}} within a timeout TevalT_{\text{eval}}, the planner prunes the edge and replans on the fly.

The computational complexity is:

  • Candidate shortcut pairs: O(S2)O(|\mathcal{S}|^2)
  • Graph construction: O(OpsObjectsarity)O(|\text{Ops}|\cdot|\text{Objects}|^\text{arity})
  • Planning: O(E+VlogV)O(|E|+|V|\log|V|) per Dijkstra search
  • RL training is parallelized over candidates; empirical pruning removes \approx99% of candidates

A property of SLAP is completeness: the system falls back to pure planning if no shortcuts are useful, and reduces to end-to-end RL if tasks are trivial.

4. Empirical Behavior: Examples and Quantitative Evaluation

SLAP is evaluated in four simulated, sparse-reward robotic environments:

  • Obstacle 2D: Planar robot; clear target region blocked by a single obstacle; 11 learned shortcuts.
  • Obstacle Tower: PyBullet Panda arm; stack of obstacles; 92 shortcuts.
  • Cluttered Drawer: Manipulator in a drawer with multiple obstacles; 74 shortcuts.
  • Cleanup Table: Manipulator, irregular toys, wiper tool; 54 shortcuts.

Learned shortcuts include "slap" (pushing stacks of objects), "wiggle" (oscillating the gripper to clear adjacent objects), and "wipe" (sweeping multiple objects in one motion), violating the STRIPS “single-object” assumption and enabling multi-object contacts.

A summary of plan length reduction is tabulated:

Environment Method Success Plan Length Reduction
Obstacle 2D SLAP 100% 17.8±2.0 ↓ 31%
Pure Planning 100% 25.8±2.2 0%
PPO 0% 100 (max) N/A
Obstacle Tower SLAP 100% 73.8±4.3 ↓ 69%
Pure Planning 100% 238.6±12.8 0%
Cleanup Table SLAP 100% 113.7±17.0 ↓ 66%
Pure Planning 100% 446.3±34.9 0%
RL baselines 0% 500 N/A

SLAP consistently achieves 100% success and substantially reduced plan lengths compared to pure planning or RL baselines. Hierarchical RL using the same options does not realize the plan optimality of SLAP.

Additional findings:

  • As training proceeds, more shortcuts are mastered and plan lengths monotonically decrease, up to 500K steps.
  • Generalization: SLAP trained on 3-block stacks generalizes plan lengths to unseen task instances (4–6 block towers, novel object sets) due to the reuse of abstract predicates and shortcut object-substitution schemes.

5. Theoretical Underpinnings and Relation to Abstract Model Learning

SLAP can be linked to frameworks that learn abstract world models over options (Rodriguez-Sanchez et al., 22 Jun 2024), where abstraction functions ϕ:SS~\phi: S \rightarrow \tilde{S} are learned to preserve the option-induced transition and initiation structure. Exact dynamics-preserving abstractions enable value preservation; approximate versions guarantee bounded value loss:

Qπ(s,o)Qπ(ϕ(s),o)ϵR+γVmaxϵT1γ|Q^\pi(s, o) - Q^\pi(\phi(s), o)| \leq \frac{\sqrt{\epsilon_R} + \gamma V_\text{max}\sqrt{\epsilon_T}}{1-\gamma}

This suggests that SLAP's methodology for learning and integrating shortcut options complements model-learning approaches by expanding the skill set over which abstractions can be constructed and planned.

Practical model-learning involves maximizing contrastive objectives (InfoNCE) to induce embeddings that capture option effects while compressing irrelevant state details. Planning then proceeds in the abstract MDP equipped with both hand-crafted and learned shortcut options, providing sample-efficient and generalizable policies even in continuous, high-dimensional environments.

6. Limitations and Future Research

Key limitations and open directions for SLAP include:

  • Dependence on fixed, user-provided abstractions (predicate definitions and high-level TAMP structure). Learning or refining abstractions jointly with options is a prospective extension.
  • The quadratic growth in candidate shortcut pairs; while aggressive empirical pruning alleviates this, scaling to very large abstract state spaces may require additional heuristics or learned prioritization schemes.
  • Applicability to stochastic or partially observable environments is not yet fully established, though preliminary findings indicate improved robustness when shortcut options are present.
  • Environments where skill effects overlap extensively may defeat abstraction, requiring more expressive or hierarchical representation mechanisms.
  • Integration of automated option discovery, continual learning of abstractions, and multi-modal perception (e.g., vision and tactile sensing) are promising future directions.

SLAP demonstrates that leveraging existing TAMP abstractions to guide reinforcement learning enables the discovery of dynamic, multi-object skills that drastically shorten plans and improve long-horizon task success, moving toward integrated planning-learning systems combining generalization with physical improvisation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Shortcut Learning for Abstract Planning (SLAP).