- The paper’s main contribution is its explicit representation of manipulation demonstrations as atomic skill-action pairs to enable zero-shot generalization.
- It introduces a dual-library retrieval system that fuses dynamic visual-similarity with static coverage-aware methods for effective demonstration selection.
- Empirical evaluations on the AGNOSTOS benchmark and real-robot experiments show improved success rates and enhanced interpretability over conventional baselines.
Skill-Based Reasoning for Zero-Shot Cross-Task Robotic Manipulation
Motivation and Context
The challenge of cross-task generalization in robotic manipulation remains an open problem, particularly in scenarios where robots must transfer learned skills to tasks featuring novel object categories, unseen goals, or new combinations of primitive actions. The prevailing paradigms—modular Vision-Language-Action (VLA) systems and end-to-end policy learning—have made significant advances in within-task generalization. However, their capacity for zero-shot adaptation to entirely novel task configurations is fundamentally limited by reliance on low-level action trajectories as the context for in-context learning, which lacks explicit skill semantics and composability.
Framework Overview
The paper introduces Decompose and Recompose, a compositional skill reasoning framework designed to bridge this gap. The central proposition is the explicit representation of manipulation demonstrations as sequences of atomic skill-action pairs, rather than numerical action trajectories. This intermediate structure facilitates both skill composability and interpretability, enabling LLMs to perform task-level reasoning over skill compositions and execution order.
The pipeline consists of four main components:
- Atomic Skill Collection: Demonstrations from seen tasks are segmented via keyframe extraction, then annotated with skill labels (e.g., Verb[obj], Verb[obj1, obj2]) using vision-LLMs. Gripper state transitions are imposed as unary constraints on skill annotation, ensuring physical consistency and semantic clarity.
- Dual-Library Demonstration Retrieval:
- Dynamic Library: Retrieves task-adaptive demonstrations based on a fusion of visual-similarity (via DINOv3-based global features) and plan-based skill-sequence alignment (using Jaccard similarity over verb sets and bigram skill chains).
- Static Coverage-Aware Library: Uses IDF-weighted skill tokens abstracted from object identities, with a greedy selection algorithm to supplement missing skill patterns based on coverage gaps identified in the predicted skill sequence.
- Skill-Augmented In-Context Learning: Demonstration prompts are constructed as skill-action aligned pairs, serving as explicit compositional guidance for LLMs. The LLM outputs only executable action sequences, which are compatible with robot controllers, while implicitly leveraging skill-level reasoning.
- Closed-Loop Execution: Low-level control commands are discretized for LLM tokenization and then reconstructed for execution, maintaining fidelity between symbolic skill guidance and physical manipulation.
Empirical Evaluation
AGNOSTOS Benchmark
Comprehensive evaluation on the AGNOSTOS benchmark—comprising 23 unseen tasks partitioned into two difficulty tiers—demonstrates the efficacy of the proposed method. Compared to diverse baselines spanning Foundation, Human-Video Pretrained, In-Domain Trained, and In-Context Learning approaches, the skill reasoning framework achieves:
- The highest overall success rates across both difficulty levels.
- Success rates exceeding 60% on Microwave, Seat, LampOff, and USB tasks, outperforming all other individual baselines.
- Average success rates of 32.5% (Level-1), 18.5% (Level-2), and 26.4% (overall).
Ablation studies further attribute performance gains to each core component:
- Task-adaptive retrieval via the dynamic library elevates baseline performance from 21.6% to 23.3%.
- Augmentation with the coverage-aware static library increases it to 24.9%.
- Skill-action alignment in in-context prompts delivers the final 26.4% performance.
The analysis of visual encoder and LLM backbone choice reveals that static scene semantic extraction (DINOv3 > CLIP > DINOv2) and enhanced LLM reasoning ability (Qwen2.5-7B > InternLM3-8B > Llama3.0-8B) are critical for cross-task generalization.
Real-World Experiments
Tests conducted on a UFACTORY xArm6 robot across five manipulation tasks confirm transferability in real-world settings. Success rates mirror simulation trends, with the framework consistently generating valid compositional skill sequences and executing precise actions.
Failure analyses identify limitations in 6-DoF pose understanding, accurate reasoning about spatial relationships, and object affordance modeling, pointing to essential challenges for future research.
Theoretical and Practical Implications
The theoretical contribution lies in elevating the generalization paradigm from trajectory similarity to skill composability. By aligning atomic skill labels with low-level actions, the framework exposes causal and procedural information to LLMs, thus activating structured reasoning over manipulation sequences. Practically, the dual-library design ensures that demonstration selection is both task-relevant and skill-complete, directly addressing coverage gaps that undermine zero-shot adaptation.
The findings contest the sufficiency of large-scale pretraining and context trajectory matching for compositional robotic generalization. Explicit intermediate representations—grounded in manipulation semantics—are necessary to elicit robust cross-task reasoning, confirming the limits of end-to-end learning for open-world deployment.
Future Directions
- Integration of explicit 3D spatial reasoning and object affordance modeling may further mitigate failure cases and enhance performance in highly unconstrained real-world environments.
- The framework is amenable to expansion with richer skill vocabularies, hierarchical skill decomposition, and multimodal context fusion, enabling scalability to increasingly complex manipulation scenarios.
- Development of adaptive demonstration selection mechanisms that dynamically balance relevance and coverage, avoiding prompt over-saturation, remains imperative.
Conclusion
The Decompose and Recompose framework presents a principled methodology for zero-shot cross-task robotic manipulation by introducing compositional, interpretable skill-action pairs as the driver of LLM-based skill reasoning. Experimental results validate the advantages of skill-centric intermediate representations and dual-library retrieval strategies, setting a new standard for general-purpose robotic manipulation under cross-task distribution shifts (2605.01448).