RoboTwin 2.0 Benchmark for Dual-Arm Robotics

Updated 4 May 2026

The paper introduces RoboTwin 2.0, a high-fidelity simulation benchmark integrating large-scale realistic data synthesis with structured domain randomization for dual-arm robotic manipulation.
It provides a unified evaluation platform with standardized protocols, diverse sensory feedback, and dynamic scene generation to ensure robust sim-to-real transfer.
The benchmark supports dual-arm task development across five robots using key metrics like success rate, mean episode reward, and sample efficiency to assess performance.

RoboTwin 2.0 Benchmark is a comprehensive simulation-based data generator and evaluation suite focused on robust and generalizable bimanual robotic manipulation. It integrates large-scale, realistic data synthesis, structured domain randomization, and unified evaluation protocols to address key challenges in developing dual-arm robotic systems that can generalize to complex real-world scenarios. The benchmark supports systematic development and assessment of both single-task and multi-task bimanual policies, serving as both a research platform and competition foundation for dual-arm manipulation in simulation and on real hardware (Chen et al., 22 Jun 2025, Chen et al., 29 Jun 2025).

1. Simulation Architecture and Data Generation

The RoboTwin 2.0 platform is structured around a closed-loop simulation framework powered by a MuJoCo-compatible physics engine, with extensions for soft-body finite element analysis (cloth and cable), customized fluid solvers, and high-fidelity contact models with Coulomb friction and soft contacts. Scene generation is performed via collision-aware placement and randomization of 3D objects, while multi-view high-resolution RGB-D rendering and simulated tactile/force-torque sensing are synchronized at 25 Hz via GPU-accelerated pipelines.

Expert data synthesis is performed by interleaving a code-generation agent (a multimodal LLM, MLLM) with a vision-language observer in simulation. The agent, conditioned on natural-language task instructions, robot API libraries, and few-shot code exemplars, autonomously generates task-level Python programs that are executed multiple times in the simulator. Symbolic execution logs and rendered visual feedback are processed to diagnose and localize execution failures, driving iterative code refinement. This loop continues until a target task success rate (>50% in 10 trials) or a repair iteration cap is met. This pipeline substantially reduces code repair overhead and token usage per task compared to RoboTwin 1.0 (avg. repair iterations from 2.42 to 1.76, tokens from 1465 to 840) (Chen et al., 22 Jun 2025).

2. Object Library and Scene Diversity

A prerequisite for RoboTwin 2.0’s diverse data generation is the RoboTwin-OD object library: 731 unique instances across 147 semantic categories (spanning rigid, articulated, and deformable objects), with explicit semantic, linguistic (15 instance-level descriptions), and manipulation-relevant affordance annotations (placement points, grasping keypoints, axis directions, convex-collision decompositions). Of these, 534 objects were captured in-house via RGB-to-3D pipelines, while 153 and 44 were sourced from Objaverse and SAPIEN PartNet-Mobility, respectively.

To promote domain diversity and enhance sim-to-real transfer, RoboTwin 2.0 implements structured domain randomization across five axes:

Axis	Implementation Highlights	Range/Details
Clutter	Up to N distractor objects, collision-aware, semantically distinct	Sampled per trajectory
Background Textures	Tabletop/background randomization; 12,000 curated textures	Stable Diffusion–synthesized, human-filtered
Lighting	Type, color temperature (2700–6500 K), intensity, and 3D position	Drawn from uniform/physically plausible distributions
Table Height	Uniform perturbation of tabletop	h_min, h_max
Language	Randomized natural-language instructions; 50 templates × 15 descs	ℓ_j(object_descriptions, arm) with j ∼ Uniform

Objects for clutter are selected by visual/semantic grouping, ensuring distractor diversity without excessive ambiguity (Chen et al., 22 Jun 2025).

3. Benchmark Task Suite and Embodiment Diversity

The benchmark instantiates 50 dual-arm tasks covering pick-and-place, stacking, containment, tool use, object transport, articulated manipulation, and tactile/fluid/deformable handling. Each task is defined by a tuple of goal configurations, ordered subgoals (e.g., grasp, lift, move, release), and a success metric:

$S_\text{task} = \frac{\text{number of successful episodes}}{\text{total episodes}}$

Embodiments span five dual-arm robots—Aloha-AgileX, ARX-X5, Piper, Franka, and UR5—with each arm modeled as 7-DOF (±180° shoulder/elbow, ±135° wrist, max velocity 2.0 rad/s, torque ±80 Nm) and two-finger parallel-jaw grippers (20 N/finger). Sensors include joint encoders (0.01°), wrist F/T, and simulated tactile. At dataset generation, each (task, robot) pair includes 100 clean and 400 strongly domain-randomized expert trajectories, yielding 100,000+ labeled dual-arm episodes (Chen et al., 22 Jun 2025, Chen et al., 29 Jun 2025).

4. Evaluation Methodology and Competition Protocols

Evaluation is standardized via:

Success Rate (SR): $SR = (1/N) \sum_{i=1}^{N} \mathbb{1}\{\text{episode}_i \text{ success}\}$
Mean Episode Reward ( $\overline{R}$ ): Aggregated per-task
Sample Efficiency (SE): Steps to reach threshold SR

The Dual-Arm Collaboration Challenge (CVPR 2025, MEIS Workshop) operationalizes the benchmark in a three-stage protocol: two simulation rounds (with domain randomization, language-conditioned tasks, unseen seeds) and a real-robot round (AgileX COBOT-Magic, 15–20 trials per task, language withheld in training). Each stage applies strong domain randomization, especially in table height ( $\pm 3$ cm), lighting ( $[0.5, 1.5]$ ×base), object mass (±10%), and camera jitter (±2 cm, ±2°). Simulation loop time is 12 ms/step at 25 Hz (Chen et al., 29 Jun 2025).

Baseline policies include generic diffusion, RT-1/RT-2-style VLA transformers, and ManiSkill 2 PPO (joint-velocity). Policy evaluation is unified across simulation and real-world contexts.

5. Policy Learning, Baselines, and Top Solutions

Top-performing solutions from the challenge include AnchorDP3 (simulation) and SEM (real-world):

AnchorDP3: Utilizes sparse keypose action representation (10–30 keyposes/task), lightweight task encoders (~0.28M parameters) with a shared diffusion decoder, full-sequence supervision, and masks out perceptual noise via simulator ground-truth segmentation. Input modalities include RGB-D, masks, and joint states. Training is staged (1M LSMQ-D, then 50K SSHQ-D), optimized using AdamW ( $\mathrm{lr}=2{\times}10^{-4}$ , 20 diffusion steps). Attains per-task SR >98%, with an overall SR of 99.7±0.2% in simulation round 2.
SEM (Spatial Enhancer Model): Lifts multi-view 2D observations to 3D point embeddings, encodes robot state with graph attention, and decodes actions via a diffusion policy conditioned on fused semantic/geometric features. Training involves simulation pretraining and SSHQ-D real fine-tuning. Inputs span multi-view RGB-D, F/T, and joint encoders. Achieves 29–72% SR on challenging real-world tasks, substantially exceeding the average team (overall 29.3% vs. 9.69%).

Task	Baseline SR (%)	AnchorDP3 SR (%)	SEM SR (%) (real)
Blocks Ranking RGB	12.3	100.0	—
Place Dual Shoes	18.7	98.0	—
Stack Plates	—	—	72.0
Pour Water	—	—	30.0

Statistical significance (paired t-test) is observed: AnchorDP3 vs. baseline $p<0.001$ , SEM vs. average $p<0.005$ (Chen et al., 29 Jun 2025).

6. Impact, Limitations, and Future Directions

RoboTwin 2.0 establishes a scalable foundation for research in robust bimanual manipulation by providing the community with:

A large-scale, domain-diverse dual-arm dataset (>100,000 labeled trajectories, 50 tasks, 5 robots)
Strongly randomized simulation protocols and standardized evaluation metrics aligned for sim-to-real transfer
Open-source code, pre-trained policies, and reproducible baselines

Empirical results demonstrate substantial improvement in automated code generation (MMFB: 63.9% → 71.3% ASR), generalization (RDT: 14.8%→25.4%, Pi0: 21.0%→29.8%, VLA model: 9.0%→42.0% on hardest real task), and sample efficiency.

Notable unsolved challenges include fine-grained contact dynamics (“Open Laptop,” “Shake Bottle”), precision articulated manipulation, generalization to dexterous hands or mobile bases, and progress-aware evaluation for partially completed tasks. Ongoing research directions target multi-object/long-horizon tasks, reinforcement-based refinement, richer deformable and fluid simulation, instruction diversity beyond fixed templates, and on-hardware closed-loop policy improvement (Chen et al., 22 Jun 2025, Chen et al., 29 Jun 2025).

7. Usage Guidelines and Reproducibility

RoboTwin 2.0 is publicly available, with installation via conda and modular simulation builds. Researchers can load policies, instantiate tasks with domain randomization, and evaluate agents using a standardized Python API:

from roboTwin2 import TaskEnv, AnchorDP3, Trainer
env = TaskEnv('stack_blocks_three', domain_randomization=True, camera_views=4)
policy = AnchorDP3(config='configs/anchordp3/stack_blocks_three.yaml')
trainer = Trainer(env, policy, max_steps=500_000, eval_interval=10_000)
trainer.train()
scores = trainer.evaluate(num_episodes=100)
print('Success rate:', scores['success_rate'])

Pre-trained models and challenge checkpoints are available for download. Extensive documentation, task specifications, and dataset details are maintained at the RoboTwin Benchmark Challenge website (Chen et al., 29 Jun 2025).

Key references:

(Chen et al., 22 Jun 2025) RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
(Chen et al., 29 Jun 2025) Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

Markdown Report Issue Upgrade to Chat

References (2)

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation (2025)

Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoboTwin 2.0 Benchmark.