UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
The paper "UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations" presents a novel framework for translating human-demonstrated skills into robot-executable actions without the need for paired data or semantic labels. Focused on addressing the challenges posed by disparate embodiments between humans and robots, UniSkill introduces a scalable solution by learning embodiment-agnostic skill representations from large-scale, unlabeled video datasets.
The core of the approach leverages an Inverse Skill Dynamics (ISD) model and a Forward Skill Dynamics (FSD) model for learning these skill representations. The ISD model extracts skills by focusing on dynamics changes between temporally distant video frames. Meanwhile, the FSD model utilizes these skills to predict future frames, effectively treating the task as an image-editing problem that emphasizes dynamic over static content. This method permits the generalization of skills across diverse human and robot embodiments, as these learned representations are intrinsically agnostic to the specific morphology of the actor within the videos.
In experimental evaluations, UniSkill demonstrates its capability to effectively transfer human video prompts to robot policies. Across both simulation and real-world setups, it achieves impressive success rates in guiding robots to replicate human-demonstrated behaviors, even in the absence of explicit guidance like language instructions or trajectory alignments. This is notably evidenced by UniSkill's performance in manipulating objects in tabletop and kitchen settings and addressing tasks in the LIBERO simulation benchmark.
The framework's embodiment-agnostic skill representations are successful in both seen and novel scenarios, which suggests significant potential for adaptability. This characteristic further allows UniSkill to handle diverse video sources, integrating information from various datasets, such as Something-Something V2, H2O, DROID, and others, to expand its action repertoire. The research emphasizes that scaling with diverse video data enhances the system's robustness and versatility.
Despite the promising outcomes, the authors note several limitations. The reliance on fixed skill intervals may constrain adaptability to varied task execution speeds. Moreover, challenges with abrupt viewpoint changes, especially in egocentric videos, highlight opportunities for further refinement. Future improvements could focus on mitigating these limitations, potentially integrating advanced temporal dynamics models and environment generalization strategies.
Overall, UniSkill represents a step towards closing the gap between human and robot capabilities through scalable, cross-embodiment skill representations. Its capacity to process large, unlabeled datasets and generalize across diverse environments positions it as a noteworthy contribution to the field of robot learning from visual inputs. Further exploration and refinement of the framework could yield even broader applications, particularly in dynamic and unstructured environments.