DexMimicGen: Scalable Data for Bimanual Robotics

Updated 12 January 2026

DexMimicGen is a system that transforms a modest set of human teleoperation demos into nearly 21,000 physically-valid synthetic trajectories for bimanual dexterous manipulation.
The framework’s modular pipeline—including demonstration pre-processing, motion retargeting, synchronized execution, and rigorous simulation validation—significantly improves task success rates.
Empirical results demonstrate over 20-fold performance gains and robust real-to-sim-to-real transfer, establishing DexMimicGen as a key tool for high-DOF visuomotor policy learning.

DexMimicGen is a large-scale automated data generation system designed to address the data bottleneck in imitation learning for bimanual dexterous robot manipulation. The framework amplifies a modest set of human teleoperation demonstrations into tens of thousands of physically valid manipulation trajectories, thereby enabling high-performing visuomotor policy training for complex, coordinated tasks with multi-armed, multi-fingered embodiments. DexMimicGen incorporates a modular pipeline for demonstration pre-processing, motion retargeting, synchronized execution, and policy learning. The system achieves significant improvements in task success rates over manually collected datasets and includes a robust real-to-sim-to-real deployment pipeline for safe real-world transfer on humanoid hardware (Jiang et al., 2024).

1. System Architecture and Pipeline

The DexMimicGen architecture is structured as a multi-stage pipeline enabling scalable generation of synthetic robot manipulation trajectories from limited human demonstration data. The pipeline components and workflow are as follows:

Data Acquisition: Collection of a small set (approximately 5–10 per task) of high-quality human teleoperated demonstrations in a simulation environment, covering diverse task variants.
Demonstration Pre-processing: Each collected demonstration $\tau$ is segmented into object-centric subtasks $\{\tau_i\}$ , each aligned to a specific subgoal $S_i(o_i)$ , and further decomposed into per-arm subtasks for bimanual sequences.
Motion Retargeting & Trajectory Generation: For each new synthetic trial, the system samples a randomized scene configuration, selects corresponding demonstration segments, and applies either:
- An SE(3) transform $T_{new} = T^{o'}_W (T^o_W)^{-1} T^C_W$ mapping subtask end-effector poses into the new scene (for loosely coordinated subtasks).
- Replay of original trajectories (for highly coordinated subtasks such as hand-offs), maintaining kinematic and temporal feasibility.
Asynchronous and Synchronized Execution: Parallel subtasks are managed via action queues per arm, while coordination requires queue alignment, and sequential constraints enforce inter-arm subtask ordering.
Validation & Trajectory Filtering: Rollouts are executed open-loop in MuJoCo or digital twins; only demonstrations satisfying collision and task-specific success predicates are retained.
Policy Learning: Aggregated demonstrations are used to train behavioral cloning (BC) or diffusion policy models on RGB-D observations.
Real-to-Sim-to-Real Option: For real robot deployment, a digital twin is built, real teleoperation data is replayed in simulation, further synthetic data is generated and validated in sim, and only successful synthetic rollouts are transferred to the real robot.

This hierarchical structure ensures that synthetic data closely matches physical and task constraints imposed by real-world manipulation.

2. Simulation Environments and Task Taxonomy

DexMimicGen incorporates an extensive suite of bimanual dexterous manipulation tasks, spanning distinct coordination regimes and physical requirements:

Embodiments:
- Dual-arm Panda robots with parallel-jaw grippers: Piece Assembly, Threading, Transport.
- Dual-arm Panda robots with 6-DoF anthropomorphic hands: Box Cleanup, Drawer Cleanup, Tray Lift.
- GR-1 humanoid with dexterous hands: Pouring, Coffee (multi-phase), Can Sorting (vision-based).
Coordination Taxonomy:
- Parallel Subtasks: Manipulation actions can proceed independently (e.g., Piece Assembly).
- Coordination Subtasks: Arms must simultaneously complete sub-goals (e.g., two-arm lifts, hand-offs).
- Sequential Subtasks: One arm’s actions are conditional on another arm’s task completion (e.g., drawer opening before object retrieval).
Execution Modes:
- Asynchronous queues for parallelism.
- Lock-step synchronization for hand-offs or shared-object subtasks.
- Explicit ordering constraints for tasks with sequential dependencies.

This environment suite is implemented atop RoboSuite and MuJoCo, enabling rigorous evaluation of the pipeline’s ability to synthesize and learn from varied manipulation behaviors.

3. Data Generation and Diversification Strategy

DexMimicGen systematically amplifies a base set of 60 human demonstrations into approximately 21,000 validated synthetic demonstrations via:

Distributional Scene Randomization: Object and scene initializations are sampled from broadened and diversified distributions $\mathcal{D}_0$ , with variants $\mathcal{D}_1, \mathcal{D}_2$ featuring increased spread or altered object sets.
Segment-Level Sampling: For each synthetic rollout, segments are independently sampled per subtask per arm, with attention to ordering constraints when using heterogeneous sources.
Trajectories Filtering: Only physically feasible, collision-free demonstrations that pass task-specific success checks are included.
Coordination Method Selection: Transform schemes are prioritized for subtasks with weak kinematic coupling, while direct replay is preferred for strong inter-arm coordination tasks to preserve feasibility.

This results in scalable, high-diversity datasets while implicitly enforcing object-centric SE(3) consistency and temporal alignment for coordinated actions.

Parameter	Value/Approach	Scale/Counts
Source human demos	5/task (dexterous), 10/task (gripper)	60 total
Output synthetic demos	~1,000/task	≈21,000 total
Scene randomization	Distributions $\mathcal{D}_0$ , $\mathcal{D}_{1,2}$	>1,000 scene variants

4. Policy Learning, Model Architectures, and Ablation

Visuomotor policies are trained using the generated data through the following schemes:

Policy Architectures:
- BC-RNN: CNN visual encoder, LSTM temporal backbone, MLP action head.
- BC-RNN-GMM: As above, with a Gaussian Mixture Model output.
- Diffusion Policy: Score-based denoising diffusion models, conditioned on visual observations.
Training Details:
- Datasets: 1,000 demos/task, 3 seeds, 200k gradient steps to convergence.
- Hyperparameters: Learning rate $10^{-4}$ , batch size 64.
Learning Objectives:
- Behavioral cloning: $L_{BC}(\theta) = \mathbb{E}_{(o,a)\sim D} [-\log \pi_\theta(a|o)]$ .
- Diffusion as per Chi et al. (2023).

Empirical ablation reveals:

Use of DexMimicGen data increases performance dramatically (e.g., Piece Assembly rises from 3.3% to 80.7% success with diffusion policy).
Replay schemes outperform SE(3) transforms on strongly coordinated subtasks, while ordering constraints have a +12% effect in sequential tasks.
Diffusion Policies consistently outperform BC-RNN and BC-RNN-GMM on most tasks.
GMM heads underperform relative to single-arm manipulation findings, suggesting architectural-task dependence.
Dataset scaling shows diminishing returns beyond ~1,000 demos for many tasks.

Task	Source Demo SR	BC-RNN-GMM SR	BC-RNN SR	Diffusion SR
Piece Assembly	3.3%	74.0%	74.7%	80.7%
Threading	1.3%	54.0%	55.3%	69.3%
Transport	52.7%	64.0%	60.0%	83.3%
Can Sorting	0.7%	75.3%	96.0%	97.3%

5. Real-to-Sim-to-Real Transfer

DexMimicGen supports direct transfer to physical platforms through a robust sim-to-real pipeline:

Digital Twin Construction: Head-mounted RGB-D and GroundingDINO estimate object poses, initializing scene in MuJoCo.
Simulation Validation: All candidate trajectories are tested in simulation; only those passing success checks are attempted on the real robot, guaranteeing safety.
Hardware: Experiments employ a Fourier GR-1 humanoid with two Inspire 6-DoF dexterous hands, and multi-camera setup.
Evaluation: On the can sorting task, policies trained on 40 DexMimicGen-validated synthetic rollouts (from 4 real demos) achieved 90% real-world success, outperforming the 0% baseline of real-demos-only models.

This approach mitigates the sim-to-real gap via domain randomization and validation, ensuring only physically plausible and task-relevant behaviors are deployed.

6. Empirical Results, Limitations, and Insights

Key findings and limitations elucidated by DexMimicGen include:

Performance Gains: Automated synthetic data yields >20-fold increases in some task success rates compared to human demonstration baselines.
Coordination Sensitivity: Subtask policy performance is highly sensitive to the motion retargeting scheme; handoff and synchronized actions require replay, while looser subtasks tolerate transforms.
Dataset Scaling: Performance plateaus at 1,000–5,000 demos for many tasks, indicating cost-effective returns for synthetic data generation.
Manual Steps: Object pose estimation and subtask segmentation remain partially manual, with absence of end-to-end automation in pipeline segmentation.
Architectural Observations: Unexpected underperformance of GMM heads (vs. BC or diffusion) on dexterous, coordinated tasks, highlighting the need for further architectural analysis.

Insight	Outcome/Observation
Coordination subtasks	Replay required for success, transforms can reduce feasibility
GMM action heads	Underperform relative to BC and diffusion for bimanual dexterous work
Scene diversity	Scene randomization crucial for effective policy generalization
Subtask taxonomy	Explicit parallel, coordination, sequential segmentation is essential
Segment alignment	Strict timing alignment needed for lockstep coordination

DexMimicGen extends prior approaches by coupling fine-grained subtask decomposition, object-centric motion retargeting, and large-scale simulation, specifically targeting bimanual dexterous manipulation (Jiang et al., 2024). While single-arm frameworks also benefit from synthetic data, DexMimicGen’s subtask-centric, multi-arm synchronization and transform-vs-replay decision mechanisms address unique coordination complexities not present in prior work such as MimicGen. Recent diffusion policy advances for dexterous control (e.g., mimic-one (Nava et al., 13 Jun 2025)) confirm the scaling benefits of large, diverse synthetic datasets, but DexMimicGen uniquely systematizes this for humanoid, bimanual, and vision-coordinated manipulation.

The pipeline’s modularity and empirical validation position it as a reference design for scalable data generation and policy learning in high-DOF manipulation settings. Practical considerations include the requirement for reliable object pose estimation and the current manual segmentation steps—a plausible implication is that automating these steps could further generalize the approach across broader manipulation domains.

For further technical details and resources, the dataset, benchmark environments, and code are provided at https://dexmimicgen.github.io/ (Jiang et al., 2024).