UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations (2505.08787v3)

Published 13 May 2025 in cs.RO and cs.CV

Abstract: Mimicry is a fundamental learning mechanism in humans, enabling individuals to learn new tasks by observing and imitating experts. However, applying this ability to robots presents significant challenges due to the inherent differences between human and robot embodiments in both their visual appearance and physical capabilities. While previous methods bridge this gap using cross-embodiment datasets with shared scenes and tasks, collecting such aligned data between humans and robots at scale is not trivial. In this paper, we propose UniSkill, a novel framework that learns embodiment-agnostic skill representations from large-scale cross-embodiment video data without any labels, enabling skills extracted from human video prompts to effectively transfer to robot policies trained only on robot data. Our experiments in both simulation and real-world environments show that our cross-embodiment skills successfully guide robots in selecting appropriate actions, even with unseen video prompts. The project website can be found at: https://kimhanjung.github.io/UniSkill.

Summary

UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations

The paper "UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations" presents a novel framework for translating human-demonstrated skills into robot-executable actions without the need for paired data or semantic labels. Focused on addressing the challenges posed by disparate embodiments between humans and robots, UniSkill introduces a scalable solution by learning embodiment-agnostic skill representations from large-scale, unlabeled video datasets.

The core of the approach leverages an Inverse Skill Dynamics (ISD) model and a Forward Skill Dynamics (FSD) model for learning these skill representations. The ISD model extracts skills by focusing on dynamics changes between temporally distant video frames. Meanwhile, the FSD model utilizes these skills to predict future frames, effectively treating the task as an image-editing problem that emphasizes dynamic over static content. This method permits the generalization of skills across diverse human and robot embodiments, as these learned representations are intrinsically agnostic to the specific morphology of the actor within the videos.

In experimental evaluations, UniSkill demonstrates its capability to effectively transfer human video prompts to robot policies. Across both simulation and real-world setups, it achieves impressive success rates in guiding robots to replicate human-demonstrated behaviors, even in the absence of explicit guidance like language instructions or trajectory alignments. This is notably evidenced by UniSkill's performance in manipulating objects in tabletop and kitchen settings and addressing tasks in the LIBERO simulation benchmark.

The framework's embodiment-agnostic skill representations are successful in both seen and novel scenarios, which suggests significant potential for adaptability. This characteristic further allows UniSkill to handle diverse video sources, integrating information from various datasets, such as Something-Something V2, H2O, DROID, and others, to expand its action repertoire. The research emphasizes that scaling with diverse video data enhances the system's robustness and versatility.

Despite the promising outcomes, the authors note several limitations. The reliance on fixed skill intervals may constrain adaptability to varied task execution speeds. Moreover, challenges with abrupt viewpoint changes, especially in egocentric videos, highlight opportunities for further refinement. Future improvements could focus on mitigating these limitations, potentially integrating advanced temporal dynamics models and environment generalization strategies.

Overall, UniSkill represents a step towards closing the gap between human and robot capabilities through scalable, cross-embodiment skill representations. Its capacity to process large, unlabeled datasets and generalize across diverse environments positions it as a noteworthy contribution to the field of robot learning from visual inputs. Further exploration and refinement of the framework could yield even broader applications, particularly in dynamic and unstructured environments.