RoboTwin-OD: Object Dataset for Robotic Manipulation

Updated 1 July 2025

RoboTwin-OD is a large-scale, semantically and functionally annotated object library and framework, featuring 731 objects and rich labels to support robust bimanual robotic manipulation training and simulation.
It integrates with automated data generation pipelines using MLLMs and VLMs in simulation-in-the-loop, enabling scalable synthesis of expert robot manipulation data with minimal human intervention.
Extensive domain randomization built on RoboTwin-OD assets, coupled with its high-fidelity data, leads to substantial sim-to-real transfer gains and improved policy generalization for robotic tasks.

RoboTwin-OD refers to multiple high-impact digital twinning and object diversity initiatives in robotics, sensing, and urban modeling, most prominently as the foundation object dataset and annotation library within RoboTwin 2.0 for robust bimanual (dual-arm) robotic manipulation. Its methodology, size, and breadth of annotation establish a new standard for scalable data generation, domain randomization, and model evaluation in robot learning.

1. Scope and Structure of RoboTwin-OD

RoboTwin-OD is a large-scale, semantically and functionally annotated object library, designed to support the synthesis of expert robotic manipulation data and enhance sim-to-real transfer for dual-arm robots. The library consists of:

731 object instances spanning 147 categories
- 534 instances (111 categories): In-house RGB-to-3D reconstructions (using the Rodin platform with convex decomposition and mesh merging for simulation accuracy).
- 153 objects (27 categories): Imported from Objaverse, augmenting visual and semantic diversity and serving as distractors in cluttered scenes.
- 44 articulated instances (9 categories): Sourced from SAPIEN PartNet-Mobility, enabling benchmarks involving articulated/multi-part objects.

Each object is annotated with both semantic information and manipulation-relevant labels. Semantic annotations include 15 language descriptions per object (automated and human-validated), covering geometry, shape, material texture, category, function, and seen/unseen descriptors. Manipulation-relevant annotations encompass grasp points, placement points, functional points, grasp axis directions, and intra-class similarity clusters, supporting precise and adaptable robot-object interactions.

A curated texture library of 12,000+ images (generated via Stable Diffusion v2 with LLM-guided prompts) enables visual randomization for backgrounds and surfaces.

RoboTwin-OD's object assets are tightly integrated into an automated data generation pipeline for large-scale, simulation-based training data:

Multimodal LLMs (MLLMs): Serve as code-generating agents (e.g., DeepSeek-V3) that output robot manipulation programs from natural language instructions.
Vision-LLMs (VLMs) (e.g., moonshot-v1): Provide multimodal feedback by "watching" the robot's execution in simulation frame-by-frame, diagnosing failures such as incorrect grasps or placements.
Simulation-in-the-Loop operates as follows:
1. A human or LLM specifies a task in natural language.
2. The LLM, referencing the object-centric API and manipulation annotations in RoboTwin-OD, generates executable task-level code.
3. In simulation, the VLM observes execution outcomes. If a problem is detected (syntax error, failed grasp, misplacement), it provides targeted feedback.
4. The LLM iteratively revises the code based on feedback until a minimum success rate (≥50%) is reached or a maximum number of revisions is exceeded.

This architecture allows minimal human intervention, producing robust, expert-aligned demonstration data across a spectrum of robot embodiments and manipulation skill APIs.

3. Structured Domain Randomization for Robustness

To improve generalization and sim-to-real transfer, RoboTwin 2.0 employs structured domain randomization over RoboTwin-OD assets along five axes:

Clutter: Distractor objects from RoboTwin-OD are sampled using intra-class similarity groups to ensure both visual and functional realism, with physical plausibility enforced via collision checks.
Lighting: Light color, type, intensity, and placement are randomized per episode to expose policies to varying illumination conditions.
Background: Surfaces and backgrounds are varied using the extensive texture library, suppressing environmental overfitting.
Tabletop Height: Workspace height is sampled from a physical range, modeling differences across platforms or setups.
Language Instructions: Tasks are instantiated using >60 templates per task and 15 object descriptions each, generating compositional linguistic diversity for each training episode.

This combinatorial approach yields a large and diverse synthetic data corpus, critical for robust policy learning.

4. Quantitative Performance and Policy Generalization

Extensive evaluation demonstrates the effectiveness of RoboTwin-OD and its downstream pipeline:

Code Generation Success: Closed-loop code synthesis with multimodal feedback increases the average program success rate (ASR) by 10.9% compared to prior versions, benefiting also from lower iteration counts and reduced token usage.
Vision-Language-Action Model Transfer:
- Fine-tuned models on RoboTwin 2.0 data (with domain randomization) deliver 367% relative improvement on unseen real-world tasks (42.0% vs. 9.0%), compared to a baseline trained only on real examples.
- Zero-shot models (trained exclusively on synthetic data) achieve a 228% relative gain.
- The benefit is pronounced for low-DoF arms, indicating that rich, object-centric annotations enable models to generalize effective grasping across hardware variants.
Sim-to-Real Gap: RoboTwin-OD’s annotation fidelity and randomization enable policies to maintain high performance under hard (randomized, cluttered) conditions, where non-pretrained models fail.

Aspect	RoboTwin-OD & 2.0 Capabilities
Objects/Categories	731 objects / 147 categories
Semantic labels	15 auto-generated, human-verified descriptions/object
Manip. labels	Grasp points, placement points, axes, function pts
Randomization	Clutter, lighting, background, table height, language
Sim-to-real gain	367% VLA improvement (synthetic+real), 228% zero-shot
Releases	Data generator, benchmark, dataset, code (open-source)

5. Applications and Benchmarking Utility

RoboTwin-OD directly supports:

Training and benchmarking vision-language-action and imitation learning policies for dual-arm and general robotic manipulation across varying environments and robot types.
Flexible deployment across diverse hardware (Franka, UR5, Aloha-AgileX, ARX-X5, Piper) and gripper types, facilitated by detailed, object-centric affordances and interaction API.
Unified evaluation protocols for closed-loop code generation, policy learning, and cross-domain generalization.
Rapid research and development by providing public access to the dataset, generation tools, and benchmark suite. This accelerates progress in sim-to-real transfer and policy robustness by removing the bottleneck of manual object digitization and annotation.

6. Significance and Future Directions

RoboTwin-OD’s scale and structure—coupled with automated, simulation-in-the-loop data generation and strong domain randomization—set a new reference for data-driven robotic learning, particularly for bimanual and complex coordinated tasks. The open sourcing of the dataset and code democratizes access to benchmark-grade assets and methodologies.

Potential future directions indicated in the data include:

Extending annotation to deformable or more complex articulated objects.
Further scaling of the expert data synthesis pipeline and randomization dimensions.
Benchmarking and training more generalist or foundation vision-language-action models.
Broader cross-embodiment and cross-domain experiments leveraging the API and annotation style.

7. Summary

RoboTwin-OD, as realized in RoboTwin 2.0, is a foundational object diversity and annotation resource for scalable and robust robotic learning. Its combination of high-fidelity 3D reconstructions, comprehensive semantic and manipulation-relevant labeling, and integration into automated simulation pipelines with domain randomization, produces demonstrable gains in code synthesis, sim-to-real transfer, and policy generalization for dual-arm robotic manipulation. Its open release is positioned to become a base infrastructure for scalable research, benchmarking, and deployment in advanced robotic manipulation domains.

PDF Markdown Chat (Upgrade)