Complex interactions with deformable objects and human collaboration remain unexplored

Extend the VisualMimic hierarchical sim-to-real framework—consisting of a task-agnostic low-level keypoint tracker trained from human motion data and a task-specific high-level visuomotor keypoint generator—to support complex humanoid loco-manipulation involving deformable objects and collaboration with humans, enabling execution from egocentric visual and proprioceptive inputs without external object state estimation.

Background

VisualMimic integrates egocentric visual perception with hierarchical whole-body control for humanoid robots. A low-level keypoint tracker learns dexterity priors from human motions, and a high-level policy generates keypoint commands from vision and proprioception; the framework enables zero-shot sim-to-real transfer on diverse loco-manipulation tasks.

Despite strong performance on rigid-object tasks such as pushing, lifting, and kicking, the paper explicitly notes that more complex interaction scenarios, specifically those involving deformable objects and collaboration with humans, have not yet been explored within VisualMimic’s current design, motivating future extensions to these domains.

References

While our hierarchical design generalizes across a range of loco-manipulation tasks, more complex interactions involving deformable objects or human collaboration remain unexplored.

— VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation (2509.20322 - Yin et al., 24 Sep 2025) in Conclusions and Limitations, Limitations paragraph

Complex interactions with deformable objects and human collaboration remain unexplored

Background

References

Related Problems