Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoboTwin-OD: Object Dataset for Robotic Manipulation

Updated 1 July 2025
  • RoboTwin-OD is a large-scale, semantically and functionally annotated object library and framework, featuring 731 objects and rich labels to support robust bimanual robotic manipulation training and simulation.
  • It integrates with automated data generation pipelines using MLLMs and VLMs in simulation-in-the-loop, enabling scalable synthesis of expert robot manipulation data with minimal human intervention.
  • Extensive domain randomization built on RoboTwin-OD assets, coupled with its high-fidelity data, leads to substantial sim-to-real transfer gains and improved policy generalization for robotic tasks.

RoboTwin-OD refers to multiple high-impact digital twinning and object diversity initiatives in robotics, sensing, and urban modeling, most prominently as the foundation object dataset and annotation library within RoboTwin 2.0 for robust bimanual (dual-arm) robotic manipulation. Its methodology, size, and breadth of annotation establish a new standard for scalable data generation, domain randomization, and model evaluation in robot learning.

1. Scope and Structure of RoboTwin-OD

RoboTwin-OD is a large-scale, semantically and functionally annotated object library, designed to support the synthesis of expert robotic manipulation data and enhance sim-to-real transfer for dual-arm robots. The library consists of:

  • 731 object instances spanning 147 categories
    • 534 instances (111 categories): In-house RGB-to-3D reconstructions (using the Rodin platform with convex decomposition and mesh merging for simulation accuracy).
    • 153 objects (27 categories): Imported from Objaverse, augmenting visual and semantic diversity and serving as distractors in cluttered scenes.
    • 44 articulated instances (9 categories): Sourced from SAPIEN PartNet-Mobility, enabling benchmarks involving articulated/multi-part objects.

Each object is annotated with both semantic information and manipulation-relevant labels. Semantic annotations include 15 language descriptions per object (automated and human-validated), covering geometry, shape, material texture, category, function, and seen/unseen descriptors. Manipulation-relevant annotations encompass grasp points, placement points, functional points, grasp axis directions, and intra-class similarity clusters, supporting precise and adaptable robot-object interactions.

A curated texture library of 12,000+ images (generated via Stable Diffusion v2 with LLM-guided prompts) enables visual randomization for backgrounds and surfaces.

2. Automated Expert Data Synthesis and Closed-Loop Refinement

RoboTwin-OD's object assets are tightly integrated into an automated data generation pipeline for large-scale, simulation-based training data:

  • Multimodal LLMs (MLLMs): Serve as code-generating agents (e.g., DeepSeek-V3) that output robot manipulation programs from natural language instructions.
  • Vision-LLMs (VLMs) (e.g., moonshot-v1): Provide multimodal feedback by "watching" the robot's execution in simulation frame-by-frame, diagnosing failures such as incorrect grasps or placements.
  • Simulation-in-the-Loop operates as follows:

    1. A human or LLM specifies a task in natural language.
    2. The LLM, referencing the object-centric API and manipulation annotations in RoboTwin-OD, generates executable task-level code.
    3. In simulation, the VLM observes execution outcomes. If a problem is detected (syntax error, failed grasp, misplacement), it provides targeted feedback.
    4. The LLM iteratively revises the code based on feedback until a minimum success rate (≥50%) is reached or a maximum number of revisions is exceeded.

This architecture allows minimal human intervention, producing robust, expert-aligned demonstration data across a spectrum of robot embodiments and manipulation skill APIs.

3. Structured Domain Randomization for Robustness

To improve generalization and sim-to-real transfer, RoboTwin 2.0 employs structured domain randomization over RoboTwin-OD assets along five axes:

  1. Clutter: Distractor objects from RoboTwin-OD are sampled using intra-class similarity groups to ensure both visual and functional realism, with physical plausibility enforced via collision checks.

  2. Lighting: Light color, type, intensity, and placement are randomized per episode to expose policies to varying illumination conditions.
  3. Background: Surfaces and backgrounds are varied using the extensive texture library, suppressing environmental overfitting.
  4. Tabletop Height: Workspace height is sampled from a physical range, modeling differences across platforms or setups.
  5. Language Instructions: Tasks are instantiated using >60 templates per task and 15 object descriptions each, generating compositional linguistic diversity for each training episode.

This combinatorial approach yields a large and diverse synthetic data corpus, critical for robust policy learning.

4. Quantitative Performance and Policy Generalization

Extensive evaluation demonstrates the effectiveness of RoboTwin-OD and its downstream pipeline:

  • Code Generation Success: Closed-loop code synthesis with multimodal feedback increases the average program success rate (ASR) by 10.9% compared to prior versions, benefiting also from lower iteration counts and reduced token usage.
  • Vision-Language-Action Model Transfer:
    • Fine-tuned models on RoboTwin 2.0 data (with domain randomization) deliver 367% relative improvement on unseen real-world tasks (42.0% vs. 9.0%), compared to a baseline trained only on real examples.
    • Zero-shot models (trained exclusively on synthetic data) achieve a 228% relative gain.
    • The benefit is pronounced for low-DoF arms, indicating that rich, object-centric annotations enable models to generalize effective grasping across hardware variants.
  • Sim-to-Real Gap: RoboTwin-OD’s annotation fidelity and randomization enable policies to maintain high performance under hard (randomized, cluttered) conditions, where non-pretrained models fail.
Aspect RoboTwin-OD & 2.0 Capabilities
Objects/Categories 731 objects / 147 categories
Semantic labels 15 auto-generated, human-verified descriptions/object
Manip. labels Grasp points, placement points, axes, function pts
Randomization Clutter, lighting, background, table height, language
Sim-to-real gain 367% VLA improvement (synthetic+real), 228% zero-shot
Releases Data generator, benchmark, dataset, code (open-source)

5. Applications and Benchmarking Utility

RoboTwin-OD directly supports:

  • Training and benchmarking vision-language-action and imitation learning policies for dual-arm and general robotic manipulation across varying environments and robot types.
  • Flexible deployment across diverse hardware (Franka, UR5, Aloha-AgileX, ARX-X5, Piper) and gripper types, facilitated by detailed, object-centric affordances and interaction API.
  • Unified evaluation protocols for closed-loop code generation, policy learning, and cross-domain generalization.
  • Rapid research and development by providing public access to the dataset, generation tools, and benchmark suite. This accelerates progress in sim-to-real transfer and policy robustness by removing the bottleneck of manual object digitization and annotation.

6. Significance and Future Directions

RoboTwin-OD’s scale and structure—coupled with automated, simulation-in-the-loop data generation and strong domain randomization—set a new reference for data-driven robotic learning, particularly for bimanual and complex coordinated tasks. The open sourcing of the dataset and code democratizes access to benchmark-grade assets and methodologies.

Potential future directions indicated in the data include:

  • Extending annotation to deformable or more complex articulated objects.
  • Further scaling of the expert data synthesis pipeline and randomization dimensions.
  • Benchmarking and training more generalist or foundation vision-language-action models.
  • Broader cross-embodiment and cross-domain experiments leveraging the API and annotation style.

7. Summary

RoboTwin-OD, as realized in RoboTwin 2.0, is a foundational object diversity and annotation resource for scalable and robust robotic learning. Its combination of high-fidelity 3D reconstructions, comprehensive semantic and manipulation-relevant labeling, and integration into automated simulation pipelines with domain randomization, produces demonstrable gains in code synthesis, sim-to-real transfer, and policy generalization for dual-arm robotic manipulation. Its open release is positioned to become a base infrastructure for scalable research, benchmarking, and deployment in advanced robotic manipulation domains.