GenMimicBench: Synthetic Benchmark for Humanoid Imitation
- GenMimicBench is a synthetic human-motion benchmark that tests zero-shot humanoid imitation by stressing policies with diverse generative video artifacts and morphological distortions.
- It employs a two-stage processing pipeline using 4D human reconstruction and robot retargeting to convert noisy videos into actionable motion commands.
- Extensive simulation and real-world trials using metrics like SR, MPKPE, and VSR reveal the benchmark’s capacity to expose robustness challenges in imitation learning.
GenMimicBench is a synthetic human-motion benchmark designed to rigorously evaluate zero-shot generalization and policy robustness in humanoid robots tasked with mimicking complex actions depicted in generated videos. It was introduced to address the core challenge of benchmarking policy robustness when imitating human actions from state-of-the-art generative video models, which produce data that is noisier and more diverse than traditional motion capture sources. GenMimicBench serves as a standardized stress test for evaluating physics-aware imitation policies under the morphological, kinetic, and observational distortions inherent to current video generation techniques (Ni et al., 4 Dec 2025).
1. Motivation and Design Framework
The principal motivation for GenMimicBench stems from the advancements in generative video modeling (notably diffusion and transformer architectures such as Wan2.1-VACE-14B and Cosmos-Predict2-14B), which can synthesize rich and varied human motions in settings unconstrained by existing mocap corpora. However, direct imitation from these videos is impeded by artefacts including morphological distortions, view-dependent occlusions, discontinuities, and physically implausible trajectories. GenMimicBench was curated to meet the following criteria:
- Diversity of Inputs: Inclusion of multiple subjects, scene backgrounds ranging from controlled indoor environments to cluttered "web-style" contexts, and systematic variation in viewpoints.
- Action Taxonomy: A hierarchical set of actions, from simple gestures through composite multi-step activities and action sequences involving both locomotion and object interaction.
- Emphasis on Distortion: Purposeful inclusion of artifacts such as camera drift, joint discontinuities, partial occlusion, and erroneous pose configurations.
- Scalable, Reproducible Protocol: All data are synthesized from public human motion and video corpora to ensure reproducibility and extensibility.
GenMimicBench thereby provides a critical platform for policies targeting “zero-shot” transfer from generated video to humanoid actuation, something rarely tractable with prior datasets (Ni et al., 4 Dec 2025).
2. Dataset Construction: Video Generation, Content, and Properties
GenMimicBench comprises 428 synthetic video clips constructed as follows:
- Video-Generation Models: Two primary backbones—Wan2.1-VACE-14B (conditioned on NTU RGB+D frames) and Cosmos-Predict2-14B (conditioned on PennAction frames).
- Subset Characteristics:
- Wan2.1: 217 videos, 5 subjects, multi-view (front/left/right), action categories include (a) simple upper-body motions (touch head, thumbs up, wave arms), (b) upper-body plus walking, (c) composite upper-body chains, (d) composite plus walking.
- Cosmos-Predict2: 211 videos, 8 subjects, "web-style" environments (clutter, lighting and camera motion), action coverage includes both standard gestures and object-centric manipulations (e.g., opening doors, lifting books).
- Specifications:
- Wan2.1: 832×480 px, 16 fps, 5 s.
- Cosmos-Predict2: 768×432 px, 16 fps, 5.8 s.
- Text prompts explicitly describe actions and, for Wan2.1, scene context. Single reference frames control subject identity and background initialization.
- Diversity: Subjects vary noticeably in body shape, clothing, and are seen under distinct environmental/clutter conditions and viewpoints.
All data reflects the intrinsic noise of generative modeling, used without additional augmentation to stress-test control pipelines.
3. Data Processing and Policy Conditioning
Transformation of the raw synthetic video data to robot-usable formats employs a two-stage pipeline:
- 4D Human Reconstruction: Each RGB frame is processed with TRAM or 4DhUMans to extract SMPL parameters—global pose (), shape (), and joint angles ().
- Robot Retargeting: The resulting SMPL trajectories are mapped to the 23-degree-of-freedom (DoF) configuration of the Unitree G1 humanoid, yielding desired robot joint angles () and 3D body keypoints ().
GenMimic policies are conditioned not on joint angles but on predicted keypoints over a 10-frame horizon. This choice is justified by the greater robustness of 3D keypoints to morphological mismatch and noise, particularly under the severe distortions introduced by generated video.
Training sequences from AMASS are filtered to exclude problematic cases, but no additional filtering or augmentation is performed on GenMimicBench itself.
4. Statistical Composition and Artifact Profile
GenMimicBench is structured to maximize coverage and variability:
| Subset | # Videos | # Subjects | Action Types |
|---|---|---|---|
| Wan2.1 | 217 | 5 | Gestures, composites, locomotion+actions |
| Cosmos-Predict2 | 211 | 8 | Gestures, object interactions, web contexts |
The dataset systematically varies subject morphology (body proportion, clothing), viewpoint (multi-view and moving camera), and environment (controlled vs. cluttered backgrounds).
Crucially, all videos retain real generative artifacts—occlusions, abrupt motion jumps, physically inconsistent joint angles, and camera jitter—with no attempt to cleanse or smooth these elements. No explicit artifact statistics are reported, as the noise is a natural consequence of the generative process (Ni et al., 4 Dec 2025).
5. Evaluation Methodologies and Metrics
Benchmarking protocols are defined as follows:
- Simulation Testing: For each clip, 256 IsaacGym rollouts are performed. Key metrics:
- Success Rate (SR): Proportion of rollouts where the robot remains upright and within 0.5 m of the planned trajectory.
- Mean Per-Keypoint Position Error (MPKPE): Average distance between robot and target keypoints.
- Local MPKPE (LMPKPE): MPKPE in the robot’s local (pelvis-centered) frame.
- No-Termination Variants: (MPKPE-NT, LMPKPE-NT)—errors computed over all frames, regardless of early termination, correcting for artificial reductions in error due to falls.
- Policy Optimization and Regularization:
- Weighted Keypoint Reward:
with higher weights for end-effectors and lower for trunk/legs. - Symmetry Regularization:
The mirrored probability ratio is used to encourage bilateral symmetry.
Real-World Robotic Evaluation: Visual Success Rate (VSR) quantifies the percentage of live trials (2–6 per motion) in which a Unitree G1 executes the given action with visual fidelity—no stumbles, and hands/feet maintained near the synthetic targets.
6. Benchmark Results and Comparative Performance
Comprehensive benchmarking highlights the following results:
Simulation Performance:
- Student policy (GenMimic, ): SR=29.8%, superior MPKPE-NT (62.5 cm) compared to strong baselines (GMT: 4.3%, TWIST: 7.5%, BeyondMimic (teacher): 23.8%).
- Privileged teacher policy (GenMimic, ): SR=86.8%, MPKPE=16.6 cm, MPKPE-NT=20.5 cm (state-of-the-art on benchmark).
- Key ablations: Conditioning on 3D keypoints vs. DoF improves SR (40% vs. 23.8%); adding weighted keypoint rewards raises SR to 77.4%; combining keypoints, weights, and symmetry (“3DP+Weights+Symmetry”) achieves 86.8% SR.
- Real-World (Unitree G1) Success:
- In-place gestures: VSR=100%
- Step+action: VSR=40%
- Turn+action: VSR=41%
- Walk+action: VSR=60%
Overall, GenMimicBench exposes the fragility of methods designed for clean, mocap-style data, confirming the necessity of noise-aware policy architectures.
7. Technical Insights, Limitations, and Prospects
Key observations and future directions are as follows:
- Artifact Tolerance: Robustness to generative video artefacts—occlusion, motion discontinuity, and morphological mismatch—is essential. Policies must manage severe deviations from plausible kinematics seen in real-world data.
- Reward Structuring: Weighted keypoint tracking and symmetry regularization are critical inductive biases, substantially improving robustness under noisy targets.
- Reconstruction Bottlenecks: The ultimate fidelity of policy tracking is currently bounded by the accuracy of 4D pose extraction from synthetic clips (TRAM, 4DhUMans); closing the domain gap between generated and real video is unresolved.
- Data Diversity: Policies are only trained on AMASS, indicating limited robustness to interactions especially involving objects; expanded training sets may yield better zero-shot generalization.
- Representation Learning: A plausible implication is that learning policies in a latent motion representation space—rather than conditioning directly on 3D keypoints—could reduce susceptibility to noise and artifacts, supporting smoother, more plausible robot behavior.
- Task Scope: The current benchmark is focused on gestures and moderate locomotion. Extension to dynamic, athletic, or multi-agent actions, as well as full-contact object manipulation, are active areas for future expansion (Ni et al., 4 Dec 2025).
GenMimicBench thus establishes a rigorous, diverse, and intentionally challenging framework for evaluating humanoid policies on synthetic, artifact-rich human videos, and demonstrates the need for specialized reward functions, robust representations, and artifact-aware training protocols in next-generation imitation learning for robotics.