Dice Question Streamline Icon: https://streamlinehq.com

Zero-shot humanoid execution from generated videos

Determine how to enable a humanoid robot to execute human actions depicted in videos generated by video generative models in a zero-shot manner, without any task-specific fine-tuning or retraining.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper explores using state-of-the-art video generative models as high-level planners for humanoid control, but generated videos often contain noise and morphological distortions that hinder direct imitation compared to real video. This motivates the need for a robust approach that can translate synthetic human-action videos into physically plausible humanoid trajectories without fine-tuning.

To address this challenge, the authors propose a two-stage pipeline that lifts video pixels to a 4D human representation, retargets it to the robot, and then uses GenMimic—a physics-aware reinforcement learning policy conditioned on 3D keypoints with symmetry regularization and weighted tracking—to mimic actions. The open question frames the broader objective of enabling zero-shot execution directly from generated videos.

References

To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner?

From Generated Human Videos to Physically Plausible Robot Trajectories (2512.05094 - Ni et al., 4 Dec 2025) in Abstract