Do as I Do: Learning Dexterous Manipulation from Everyday Human Videos

This lightning talk explores a breakthrough pipeline that transforms ordinary monocular human videos into robot-executable manipulation trajectories. By combining foundation models for 4D hand-object reconstruction with dynamics-aware retargeting, the work enables robots with multi-fingered hands to learn complex manipulation skills simply by watching people perform everyday tasks, overcoming the traditional barriers of specialized sensors and curated datasets.
Script
Teaching robots dexterous manipulation has always required expensive motion capture rigs or carefully curated datasets. But what if we could train them simply by showing them everyday videos of people using their hands?
The authors introduce DO AS I DO, a two-stage pipeline that first reconstructs the full 4D hand and object trajectories from a single camera view, then retargets those human motions onto a robot hand through physics-aware optimization.
Three innovations make the retargeting robust to noisy reconstructions: warmup steps that let the robot reach feasible starting poses, random force perturbations that simulate real-world uncertainties, and transition rewards that penalize missed grasps or placements.
On 655 challenging in-the-wild video clips, the method achieves a 71% success rate in generating executable robot trajectories, nearly tripling the performance of prior sampling-based approaches that plateau at just 25%.
The authors discovered that fewer than 5% of raw human activity videos meet the quality bar for robot learning, revealing that scalability depends not just on methods but on careful data curation and quality control.
By closing the gap between human observation and robot execution, this work makes it possible to scale dexterous manipulation datasets from the billions of everyday videos already online. Explore the full technical depth and create your own video summaries at EmergentMind.com.