RoboTwin Benchmark

Updated 23 March 2026

RoboTwin Benchmark is a dual-arm, real-to-sim manipulation benchmark that leverages digital twin generation and LLM-driven task synthesis for efficient, generalizable robotic skill acquisition.
It integrates advanced simulation-in-the-loop testing with structured domain randomization and heterogeneous hardware evaluations to ensure robust performance.
Key performance metrics include task success rates, completion times, and sample efficiency, enabling standardized and reproducible comparisons across diverse manipulation tasks.

The RoboTwin Benchmark is a dual-arm, real-to-sim manipulation benchmark and data-generation suite designed to facilitate robust, generalizable, and sample-efficient bimanual robotic skill acquisition. It provides a unified, scalable platform spanning synthetic data generation, closed-loop task-code synthesis, full-stack simulation, and real-world robotic evaluation. RoboTwin’s defining features are its use of generative digital twins for asset creation, LLM-driven expert-code generation for trajectory synthesis, structured domain randomization, heterogeneous dual-arm task suites, rigorous success metrics, and rigorous validation in both sim and hardware settings (Mu et al., 17 Apr 2025, Chen et al., 22 Jun 2025, Mu et al., 2024, Chen et al., 29 Jun 2025, Fang et al., 19 Dec 2025, Zhao et al., 24 Jun 2025).

1. Core Components: Digital Twin Generation and Automated Task Synthesis

RoboTwin centers on translating a single RGB image of an object into a textured, physically plausible 3D mesh using a 3D generative foundation model (typically Rodin, fine-tuned on ShapeNet-like collections), producing digital twins annotated with grasp, functional, and placement affordances (Mu et al., 17 Apr 2025, Mu et al., 2024, Chen et al., 22 Jun 2025). Each mesh is further processed by a spatial annotation module to extract:

Functional points and axes (e.g., hammer strike points)
Contact/grasp points and approach vectors
Lateral axis for right-hand coordinate framing

These annotations seed an LLM-driven code-generation pipeline that decomposes natural language task descriptions into subtasks, infers spatial constraints, and synthesizes Pythonic dual-arm motion code using task-specific APIs. The core algorithm iteratively refines programs via simulation-in-the-loop testing and vision-LLM (VLM) diagnostics until a threshold success rate is obtained in simulated rollouts (Chen et al., 22 Jun 2025).

Tasks addressed range from rigid-body manipulation (stack, place, handover, tool use) to more complex scenarios involving articulated or multi-object coordination (Mu et al., 17 Apr 2025, Chen et al., 22 Jun 2025, Mu et al., 2024).

2. Dataset Structure, Domain Randomization, and Embodiment Diversity

RoboTwin’s asset library, RoboTwin-OD, spans 731 unique meshes across 147 semantic categories, integrated with affordance labels, 15 language descriptions per asset, and convex-decomposed collision meshes (Chen et al., 22 Jun 2025). Task programs are generated for 50 (RoboTwin 2.0) or 15 (RoboTwin 1.0) dual-arm tasks, designed to capture a diverse manipulation landscape.

Structured domain randomization is systematically applied across five axes:

Clutter: number and placement of distractor objects (N∼Uniform{0,1,…,10})
Background texture: random selection from 12,000 textures or images
Lighting: color temperature in [2000 K,8000 K], intensity in [0.5,2.0], and type/position randomization
Table height: sampled from [0.7 m,0.8 m]
Language instruction: selection of up to 900 distinct instruction templates per task

This high-dimensional randomization space drives policy robustness to sim-to-real shifts (Chen et al., 22 Jun 2025, Chen et al., 29 Jun 2025, Zhao et al., 24 Jun 2025).

Task programs are instantiated on five robot platforms (e.g., Franka Panda 7-DOF arm, UR5, ARX-X5, Piper, Aloha-AgileX), spanning different kinematic and hardware configurations (Chen et al., 22 Jun 2025, Fang et al., 19 Dec 2025, Chen et al., 29 Jun 2025). Over 100,000 dual-arm expert trajectories are provided, captured at high frequency (e.g., 20 Hz), and each trajectory records synchronized visual sensory streams, proprioceptive state, force/torque readings, and action call data (Chen et al., 22 Jun 2025).

3. Benchmark Evaluation Protocols and Metrics

The RoboTwin benchmark formalizes evaluation with statistically robust metrics:

Success rate per task:

$S = \frac{N_\text{success}}{N_\text{total}}$

where $N_\text{success}$ is the number of episodes where the final object pose satisfies task-specific affordance and goal constraints (e.g., position within ≤5 cm and orientation ≤15°).

Average completion time:

$\bar T = \frac{1}{N_\text{success}} \sum_{i:\,\text{success}} t_i$

Sample efficiency: minimum $n$ to hit $S_n\geq0.8$ .
Sim-to-real gap: $S_\text{sim} - S_\text{real}$ , often reported with 95% confidence intervals via bootstrapping (Mu et al., 17 Apr 2025, Chen et al., 29 Jun 2025, Mu et al., 2024).

For challenge and competition settings (e.g., CVPR 2025 MEIS RoboTwin Challenge), multi-round scoring schemes are deployed, with per-task weighting, time-penalty normalization, and composite mixture-of-metrics for performance benchmarking (Chen et al., 29 Jun 2025).

4. Notable Benchmarks, Leaderboards, and Comparative Performance

AnchorDP3 established the state of the art in simulation tracks of the RoboTwin benchmark, achieving an average task success rate of 98.7% under extreme domain randomization—exceeding prior diffusion-based dual-arm approaches by several percentage points. The method’s key advantages arose from simulator-supervised semantic segmentation, task-conditioned encoders, affordance-anchored keypose action diffusion (10–30 waypoints policy output), and full state supervision (joint angles + end-effector pose), yielding fast convergence and geometric consistency (Zhao et al., 24 Jun 2025, Chen et al., 29 Jun 2025).

A representative leaderboard:

Method	Sim R1 Avg S (%)	Sim R2 Avg S (%)	Real Avg Score
AnchorDP3	98.7	96.5	–
SEM	94.3	89.7	26.4/100
Next Best	92.1	84.5	22.1

(Chen et al., 29 Jun 2025)

On heterogeneous single- and dual-arm tasks, the 3D Diffusion Policy (DP3) and point-cloud-based imitation learners trained on RoboTwin synthetic data consistently surpassed those trained on real-only datasets, e.g., for single-arm tasks success improved from ≈1%→72% and for dual-arm tasks ≈20%→62% after pre-training on 300 sim demos and fine-tuning on 20 real (Mu et al., 17 Apr 2025, Mu et al., 2024).

Large-scale Vision-Language-Action (VLA) models, when fine-tuned on RoboTwin 2.0 synthetic data, showed a 367% gain in zero-shot generalization on unseen real-world tasks compared to base models (from 9.0% to 42.0%), and 228% gain for zero-shot, synthetic-only policies (Chen et al., 22 Jun 2025, Fang et al., 19 Dec 2025). In direct head-to-head on seven core RoboTwin 2.0 tasks, joint motion-image diffusion models pushed average “easy mode” success from 45.1% to 58.0%, with largest gains in long-horizon, dual-arm coordination (Fang et al., 19 Dec 2025).

5. Competition Structure, Task Diversity, and Real-to-Sim Pipeline

The RoboTwin Dual-Arm Collaboration Challenge comprises three rounds:

Simulation Round 1: six tasks (5 rigid, 1 tactile), single-task policies, no domain randomization at test.
Simulation Round 2: six rigid tasks, unified models, language-conditioned task-switching, strong domain randomization (backgrounds, clutter, height ±3 cm, lighting).
Real-World Round: five tasks, single policy across all, evaluation under controlled variations (camera pose, background, table height), limited demonstration budget (Chen et al., 29 Jun 2025).

Task types include rigid-body manipulation (e.g., cap pen, place phone stand, stack blocks), deformable (towel folding using FEA meshes), and tactile-based classification (binsorting using GelSight fingertip signals).

The pipeline for real-to-sim transfer involves digital twin generation, automatic spatial annotation, closed-loop LLM program synthesis, domain-randomized demonstration collection, and hardware deployment, with camera and affordance calibration but no need for human demonstration or real-world fine-tuning for sim-only policies (Mu et al., 17 Apr 2025, Zhao et al., 24 Jun 2025).

6. Limitations, Design Choices, and Future Directions

Reported limitations include dependence on the quality of 3D digital twin reconstruction (failure modes for thin/transparent objects), LLM code generation hallucinations or incorrect collision-avoidance in dense scenes, and the current benchmark’s focus on rigid objects—deformable and articulated object support is limited (Mu et al., 2024, Mu et al., 17 Apr 2025).

Dual-arm coordination remains bottlenecked for imitation learning policies, with hard task variants yielding <20% success even in simulation in earlier iterations (Mu et al., 2024). Future aims include differentiable contact/surface physics, deeper LLM planning–motion primitive integration, autonomous code safety verification, and benchmarks for collaborative multi-object or nonprehensile tasks (Mu et al., 17 Apr 2025, Chen et al., 29 Jun 2025).

Lessons from prior benchmarks such as DeepClaw suggest that modular mechanical cell design, standardized calibration, modular software with strict ROS interface/format standards, and comprehensive, function-wise plus task-wise metric tracking are critical for reproducibility and cross-laboratory comparability (Wan et al., 2020).

7. Impact, Usage, and Community Integration

RoboTwin represents a paradigm shift in dual-arm manipulation benchmarking, lowering the barrier to scalable, high-diversity, real-aligned data synthesis, and enabling standardized, apples-to-apples evaluation of multimodal, vision-language-conditioned, or diffusion-based manipulation policies. Methodological innovations have translated to practical gains in policy robustness, sim-to-real transfer, and sample efficiency, as evidenced by published leaderboards (Chen et al., 29 Jun 2025, Zhao et al., 24 Jun 2025, Chen et al., 22 Jun 2025, Fang et al., 19 Dec 2025, Mu et al., 2024). All major datasets, expert code generation tooling, and evaluation infrastructure have been made publicly available, fostering rapid, reproducible progress in the field.

The benchmark continues to evolve toward more complex physics, higher-level language instruction following, deformable object handling, and robust uncertainty-aware perception, aligning with community-driven research directions highlighted in interdisciplinary competitions and challenge reports (Chen et al., 29 Jun 2025, Mu et al., 17 Apr 2025, Chen et al., 22 Jun 2025).