RoboTwin-OD: Object Dataset for Robotic Manipulation
- RoboTwin-OD is a large-scale, semantically and functionally annotated object library and framework, featuring 731 objects and rich labels to support robust bimanual robotic manipulation training and simulation.
- It integrates with automated data generation pipelines using MLLMs and VLMs in simulation-in-the-loop, enabling scalable synthesis of expert robot manipulation data with minimal human intervention.
- Extensive domain randomization built on RoboTwin-OD assets, coupled with its high-fidelity data, leads to substantial sim-to-real transfer gains and improved policy generalization for robotic tasks.
RoboTwin-OD refers to multiple high-impact digital twinning and object diversity initiatives in robotics, sensing, and urban modeling, most prominently as the foundation object dataset and annotation library within RoboTwin 2.0 for robust bimanual (dual-arm) robotic manipulation. Its methodology, size, and breadth of annotation establish a new standard for scalable data generation, domain randomization, and model evaluation in robot learning.
1. Scope and Structure of RoboTwin-OD
RoboTwin-OD is a large-scale, semantically and functionally annotated object library, designed to support the synthesis of expert robotic manipulation data and enhance sim-to-real transfer for dual-arm robots. The library consists of:
- 731 object instances spanning 147 categories
- 534 instances (111 categories): In-house RGB-to-3D reconstructions (using the Rodin platform with convex decomposition and mesh merging for simulation accuracy).
- 153 objects (27 categories): Imported from Objaverse, augmenting visual and semantic diversity and serving as distractors in cluttered scenes.
- 44 articulated instances (9 categories): Sourced from SAPIEN PartNet-Mobility, enabling benchmarks involving articulated/multi-part objects.
Each object is annotated with both semantic information and manipulation-relevant labels. Semantic annotations include 15 language descriptions per object (automated and human-validated), covering geometry, shape, material texture, category, function, and seen/unseen descriptors. Manipulation-relevant annotations encompass grasp points, placement points, functional points, grasp axis directions, and intra-class similarity clusters, supporting precise and adaptable robot-object interactions.
A curated texture library of 12,000+ images (generated via Stable Diffusion v2 with LLM-guided prompts) enables visual randomization for backgrounds and surfaces.
2. Automated Expert Data Synthesis and Closed-Loop Refinement
RoboTwin-OD's object assets are tightly integrated into an automated data generation pipeline for large-scale, simulation-based training data:
- Multimodal LLMs (MLLMs): Serve as code-generating agents (e.g., DeepSeek-V3) that output robot manipulation programs from natural language instructions.
- Vision-LLMs (VLMs) (e.g., moonshot-v1): Provide multimodal feedback by "watching" the robot's execution in simulation frame-by-frame, diagnosing failures such as incorrect grasps or placements.
- Simulation-in-the-Loop operates as follows:
- A human or LLM specifies a task in natural language.
- The LLM, referencing the object-centric API and manipulation annotations in RoboTwin-OD, generates executable task-level code.
- In simulation, the VLM observes execution outcomes. If a problem is detected (syntax error, failed grasp, misplacement), it provides targeted feedback.
- The LLM iteratively revises the code based on feedback until a minimum success rate (≥50%) is reached or a maximum number of revisions is exceeded.
This architecture allows minimal human intervention, producing robust, expert-aligned demonstration data across a spectrum of robot embodiments and manipulation skill APIs.
3. Structured Domain Randomization for Robustness
To improve generalization and sim-to-real transfer, RoboTwin 2.0 employs structured domain randomization over RoboTwin-OD assets along five axes:
Clutter: Distractor objects from RoboTwin-OD are sampled using intra-class similarity groups to ensure both visual and functional realism, with physical plausibility enforced via collision checks.
- Lighting: Light color, type, intensity, and placement are randomized per episode to expose policies to varying illumination conditions.
- Background: Surfaces and backgrounds are varied using the extensive texture library, suppressing environmental overfitting.
- Tabletop Height: Workspace height is sampled from a physical range, modeling differences across platforms or setups.
- Language Instructions: Tasks are instantiated using >60 templates per task and 15 object descriptions each, generating compositional linguistic diversity for each training episode.
This combinatorial approach yields a large and diverse synthetic data corpus, critical for robust policy learning.
4. Quantitative Performance and Policy Generalization
Extensive evaluation demonstrates the effectiveness of RoboTwin-OD and its downstream pipeline:
- Code Generation Success: Closed-loop code synthesis with multimodal feedback increases the average program success rate (ASR) by 10.9% compared to prior versions, benefiting also from lower iteration counts and reduced token usage.
- Vision-Language-Action Model Transfer:
- Fine-tuned models on RoboTwin 2.0 data (with domain randomization) deliver 367% relative improvement on unseen real-world tasks (42.0% vs. 9.0%), compared to a baseline trained only on real examples.
- Zero-shot models (trained exclusively on synthetic data) achieve a 228% relative gain.
- The benefit is pronounced for low-DoF arms, indicating that rich, object-centric annotations enable models to generalize effective grasping across hardware variants.
- Sim-to-Real Gap: RoboTwin-OD’s annotation fidelity and randomization enable policies to maintain high performance under hard (randomized, cluttered) conditions, where non-pretrained models fail.
Aspect | RoboTwin-OD & 2.0 Capabilities |
---|---|
Objects/Categories | 731 objects / 147 categories |
Semantic labels | 15 auto-generated, human-verified descriptions/object |
Manip. labels | Grasp points, placement points, axes, function pts |
Randomization | Clutter, lighting, background, table height, language |
Sim-to-real gain | 367% VLA improvement (synthetic+real), 228% zero-shot |
Releases | Data generator, benchmark, dataset, code (open-source) |
5. Applications and Benchmarking Utility
RoboTwin-OD directly supports:
- Training and benchmarking vision-language-action and imitation learning policies for dual-arm and general robotic manipulation across varying environments and robot types.
- Flexible deployment across diverse hardware (Franka, UR5, Aloha-AgileX, ARX-X5, Piper) and gripper types, facilitated by detailed, object-centric affordances and interaction API.
- Unified evaluation protocols for closed-loop code generation, policy learning, and cross-domain generalization.
- Rapid research and development by providing public access to the dataset, generation tools, and benchmark suite. This accelerates progress in sim-to-real transfer and policy robustness by removing the bottleneck of manual object digitization and annotation.
6. Significance and Future Directions
RoboTwin-OD’s scale and structure—coupled with automated, simulation-in-the-loop data generation and strong domain randomization—set a new reference for data-driven robotic learning, particularly for bimanual and complex coordinated tasks. The open sourcing of the dataset and code democratizes access to benchmark-grade assets and methodologies.
Potential future directions indicated in the data include:
- Extending annotation to deformable or more complex articulated objects.
- Further scaling of the expert data synthesis pipeline and randomization dimensions.
- Benchmarking and training more generalist or foundation vision-language-action models.
- Broader cross-embodiment and cross-domain experiments leveraging the API and annotation style.
7. Summary
RoboTwin-OD, as realized in RoboTwin 2.0, is a foundational object diversity and annotation resource for scalable and robust robotic learning. Its combination of high-fidelity 3D reconstructions, comprehensive semantic and manipulation-relevant labeling, and integration into automated simulation pipelines with domain randomization, produces demonstrable gains in code synthesis, sim-to-real transfer, and policy generalization for dual-arm robotic manipulation. Its open release is positioned to become a base infrastructure for scalable research, benchmarking, and deployment in advanced robotic manipulation domains.