Shortcut Learning for Abstract Planning (SLAP)
- SLAP is a framework that combines task and motion planning abstractions with reinforcement learning to discover multi-step shortcut options in complex robotic domains.
- It integrates human-engineered abstract actions with automatically learned policies to minimize plan lengths and boost success rates in deterministic and continuous MDP settings.
- Empirical evaluations in various simulated environments demonstrate that SLAP consistently achieves 100% task success while significantly reducing planning times compared to pure planning methods.
Shortcut Learning for Abstract Planning (SLAP) refers to a family of frameworks and algorithmic strategies that leverage the abstraction capabilities of Task and Motion Planning (TAMP) and temporally-extended actions (options) to introduce learned "shortcut" skills into model-based or model-free long-horizon decision-making. By augmenting a fixed set of human-engineered abstract actions with automatically discovered multi-step policies, SLAP aims to minimize plan length, improve task success rate, and bridge the gap between combinatorial planning and reinforcement learning in high-dimensional, continuous, and sparse-reward robotic domains.
1. Formal Foundation: Abstractions and Task Decomposition
SLAP operates in the setting of deterministic or continuous-state, continuous-action Markov Decision Processes (MDPs) or their stochastic variants, with temporally extended actions—options—that provide hierarchical structure. The key ingredients are:
- State space:
- Action space:
- Deterministic transitions:
- Sparse reward: per time step; reaching a goal state is non-penalized
- Objective: Minimize episode length (cumulative cost)
In TAMP, the state space is abstracted using predicates so that , providing symbolic or geometrically meaningful state representations. Options are tuples mapping initiation and termination abstract states to a low-level policy .
The abstract planning problem builds a directed two-level graph , highlighting the compositionality and reachability of hand-engineered as well as learned skills.
2. The Shortcut Learning Mechanism
SLAP introduces an algorithmic process for identifying and integrating new shortcut options:
- Candidate Extraction: Search the abstract graph for pairs not already directly connected by an option but reachable via a multi-step plan.
- Pruning via Stochastic Rollouts: For each candidate, estimate via random rollouts the empirical reachability; only proceed to learn shortcuts where naïve exploration rarely discovers the transition.
- Shortcut Policy Learning: For retained candidates, instantiate a new MDP where the goal is to reach from . Use model-free RL (typically PPO) to learn a policy under the same low-level environment dynamics and reward structure:
Training terminates when .
- Integration: Each learned shortcut is added to the set of available options, updating the abstract graph for planning.
This process is algorithmically captured as follows (pseudocode excerpt):
1 2 3 4 5 6 7 8 9 |
Algorithm: SLAP Training
Input: training tasks {(x₀,g)}, transition f, options 𝒜, N_rollout, T_rollout, K_rollout
1. For each training task:
a. Build abstract graph G using 𝒜
b. Extract candidate state pairs (s_i, s_j) not in 𝒜 but reachable in G
c. For each candidate, run N_rollout rollouts of length ≤T_rollout
If K_rollout reach s_j, retain candidate.
2. For each retained candidate, create MDP and train PPO policy π_{i,j}
3. Add new shortcut to 𝒜 |
3. Planning with Shortcuts: Integration and Execution
At test time, SLAP uses the augmented option set (hand-crafted and learned shortcuts) to construct the planning graph. Dijkstra’s algorithm is run on the low-level graph to search for the minimal-cost trajectory (in terms of time steps). Execution proceeds by sequentially triggering policies corresponding to the selected options. If a learned shortcut fails to reach its declared within a timeout , the planner prunes the edge and replans on the fly.
The computational complexity is:
- Candidate shortcut pairs:
- Graph construction:
- Planning: per Dijkstra search
- RL training is parallelized over candidates; empirical pruning removes 99% of candidates
A property of SLAP is completeness: the system falls back to pure planning if no shortcuts are useful, and reduces to end-to-end RL if tasks are trivial.
4. Empirical Behavior: Examples and Quantitative Evaluation
SLAP is evaluated in four simulated, sparse-reward robotic environments:
- Obstacle 2D: Planar robot; clear target region blocked by a single obstacle; 11 learned shortcuts.
- Obstacle Tower: PyBullet Panda arm; stack of obstacles; 92 shortcuts.
- Cluttered Drawer: Manipulator in a drawer with multiple obstacles; 74 shortcuts.
- Cleanup Table: Manipulator, irregular toys, wiper tool; 54 shortcuts.
Learned shortcuts include "slap" (pushing stacks of objects), "wiggle" (oscillating the gripper to clear adjacent objects), and "wipe" (sweeping multiple objects in one motion), violating the STRIPS “single-object” assumption and enabling multi-object contacts.
A summary of plan length reduction is tabulated:
| Environment | Method | Success | Plan Length | Reduction |
|---|---|---|---|---|
| Obstacle 2D | SLAP | 100% | 17.8±2.0 | ↓ 31% |
| Pure Planning | 100% | 25.8±2.2 | 0% | |
| PPO | 0% | 100 (max) | N/A | |
| Obstacle Tower | SLAP | 100% | 73.8±4.3 | ↓ 69% |
| Pure Planning | 100% | 238.6±12.8 | 0% | |
| Cleanup Table | SLAP | 100% | 113.7±17.0 | ↓ 66% |
| Pure Planning | 100% | 446.3±34.9 | 0% | |
| RL baselines | 0% | 500 | N/A |
SLAP consistently achieves 100% success and substantially reduced plan lengths compared to pure planning or RL baselines. Hierarchical RL using the same options does not realize the plan optimality of SLAP.
Additional findings:
- As training proceeds, more shortcuts are mastered and plan lengths monotonically decrease, up to 500K steps.
- Generalization: SLAP trained on 3-block stacks generalizes plan lengths to unseen task instances (4–6 block towers, novel object sets) due to the reuse of abstract predicates and shortcut object-substitution schemes.
5. Theoretical Underpinnings and Relation to Abstract Model Learning
SLAP can be linked to frameworks that learn abstract world models over options (Rodriguez-Sanchez et al., 22 Jun 2024), where abstraction functions are learned to preserve the option-induced transition and initiation structure. Exact dynamics-preserving abstractions enable value preservation; approximate versions guarantee bounded value loss:
This suggests that SLAP's methodology for learning and integrating shortcut options complements model-learning approaches by expanding the skill set over which abstractions can be constructed and planned.
Practical model-learning involves maximizing contrastive objectives (InfoNCE) to induce embeddings that capture option effects while compressing irrelevant state details. Planning then proceeds in the abstract MDP equipped with both hand-crafted and learned shortcut options, providing sample-efficient and generalizable policies even in continuous, high-dimensional environments.
6. Limitations and Future Research
Key limitations and open directions for SLAP include:
- Dependence on fixed, user-provided abstractions (predicate definitions and high-level TAMP structure). Learning or refining abstractions jointly with options is a prospective extension.
- The quadratic growth in candidate shortcut pairs; while aggressive empirical pruning alleviates this, scaling to very large abstract state spaces may require additional heuristics or learned prioritization schemes.
- Applicability to stochastic or partially observable environments is not yet fully established, though preliminary findings indicate improved robustness when shortcut options are present.
- Environments where skill effects overlap extensively may defeat abstraction, requiring more expressive or hierarchical representation mechanisms.
- Integration of automated option discovery, continual learning of abstractions, and multi-modal perception (e.g., vision and tactile sensing) are promising future directions.
SLAP demonstrates that leveraging existing TAMP abstractions to guide reinforcement learning enables the discovery of dynamic, multi-object skills that drastically shorten plans and improve long-horizon task success, moving toward integrated planning-learning systems combining generalization with physical improvisation.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free