Symskill: Symbol and Skill Co-Invention for Data-Efficient and Real-Time Long-Horizon Manipulation (2510.01661v1)

Published 2 Oct 2025 in cs.RO

Abstract: Multi-step manipulation in dynamic environments remains challenging. Two major families of methods fail in distinct ways: (i) imitation learning (IL) is reactive but lacks compositional generalization, as monolithic policies do not decide which skill to reuse when scenes change; (ii) classical task-and-motion planning (TAMP) offers compositionality but has prohibitive planning latency, preventing real-time failure recovery. We introduce SymSkill, a unified learning framework that combines the benefits of IL and TAMP, allowing compositional generalization and failure recovery in real-time. Offline, SymSkill jointly learns predicates, operators, and skills directly from unlabeled and unsegmented demonstrations. At execution time, upon specifying a conjunction of one or more learned predicates, SymSkill uses a symbolic planner to compose and reorder learned skills to achieve the symbolic goals, while performing recovery at both the motion and symbolic levels in real time. Coupled with a compliant controller, SymSkill enables safe and uninterrupted execution under human and environmental disturbances. In RoboCasa simulation, SymSkill can execute 12 single-step tasks with 85% success rate. Without additional data, it composes these skills into multi-step plans requiring up to 6 skill recompositions, recovering robustly from execution failures. On a real Franka robot, we demonstrate SymSkill, learning from 5 minutes of unsegmented and unlabeled play data, is capable of performing multiple tasks simply by goal specifications. The source code and additional analysis can be found on https://sites.google.com/view/symskill.

Summary

The paper demonstrates a novel framework that co-invents symbols and skills from raw demonstrations to enable efficient real-time planning in long-horizon tasks.
The methodology segments demonstrations into motion phases and learns predicates and operators using Gaussian distributions to capture spatial relationships.
Experimental results in both simulation and real-world tests show improved task success rates and robust recovery with passive impedance control.

Symskill: Symbol and Skill Co-Invention for Data-Efficient and Real-Time Long-Horizon Manipulation

Introduction

The paper "Symskill: Symbol and Skill Co-Invention for Data-Efficient and Real-Time Long-Horizon Manipulation" focuses on addressing challenges in robotic manipulation tasks that require multi-step planning and real-time execution recovery. Traditional methods like imitation learning (IL) and task-and-motion planning (TAMP) have limitations in terms of compositional generalization and real-time adaptability. The proposed "SymSkill" framework combines the strengths of IL and TAMP, enabling data-efficient learning from unlabeled, unsegmented demonstrations and facilitating real-time task execution alongside compositional generalization.

Methodology

SymSkill introduces a unified framework leveraging unsupervised learning to simultaneously learn predicates, operators, and skills from raw demonstration data. It achieves this through the segmentation of demonstrations into premotion and motion segments, where the robot's trajectory data is expressed in relative frames associated with key objects of interest.

1. Demo Segmentation and Reference-Frame Selection:

Each demonstration is divided into segments based on changes in object motion, with specific objects designated for reference frames during skill learning. This segmentation into gripper-only and gripper-object segments allows focused learning on meaningful interactions.

Figure 1: The VLM prompt used for the real-world learning-from-play experiment proceeds as follows. First, the initial image is used to obtain text descriptions of all objects in view. Next, four equally spaced images from each motion segment are provided to Gemini together with the required output enumeration object, using the structured output feature. The returned text is then mapped back to the corresponding object name.

2. Relative Pose Predicate Learning:

The framework uses Gaussian distributions to learn relative pose predicates, capturing meaningful spatial relationships between objects and the robot's end effector. These predicates serve as essential symbolic abstractions for planning and execution.

Figure 2: Illustration of the \methodname{} predicate and skill co-invention process on a DoorOpen task.

3. Operator Learning and Skill Integration:

Operators are built by associating learned predicates with symbolic transitions derived from demonstration trajectories. Skills, represented as SE(3) dynamical system policies, are trained for operators to ensure stable and efficient execution.

4. Real-Time Execution and Monitoring:

SymSkill enables real-time execution by monitoring state transitions and re-planning in response to deviations or failures. It uses a passive impedance controller to ensure continuous and safe task execution.

Figure 3: Real-world execution of \methodname{}.

Experimental Results

Simulated Environment:

In the RoboCasa simulation, SymSkill demonstrated a high success rate across various single-step manipulation tasks. The framework's ability to learn and compose skills in real-time outperformed baselines such as diffusion policies, especially in data-limited scenarios.

Real-World Application:

SymSkill showed effective learning from real-world play data captured using motion capture systems and webcams. Learned operators and skills were used to execute complex tasks such as object sorting and storage, validating the framework's robustness and adaptability in dynamic environments.

Figure 4: Real-world data collection pipeline. We use a motion capture system to record object interactions in the workspace.

Conclusion

The paper highlights SymSkill's contributions to advancing robotic manipulation through efficient symbol and skill co-invention. Its sample-efficient learning and real-time planning capabilities make it a compelling solution for long-horizon tasks in dynamic environments. Future work could explore extending the framework to mobile manipulation and egocentric video input, broadening the scope of applications.

SymSkill provides a robust foundation for real-world robotic systems requiring adaptability, compositionality, and efficiency, setting a new standard in the integration of IL and TAMP methodologies.