ImitationNet: Unsupervised Human-to-Robot Motion Retargeting via Shared Latent Space (2309.05310v3)

Published 11 Sep 2023 in cs.RO and cs.AI

Abstract: This paper introduces a novel deep-learning approach for human-to-robot motion retargeting, enabling robots to mimic human poses accurately. Contrary to prior deep-learning-based works, our method does not require paired human-to-robot data, which facilitates its translation to new robots. First, we construct a shared latent space between humans and robots via adaptive contrastive learning that takes advantage of a proposed cross-domain similarity metric between the human and robot poses. Additionally, we propose a consistency term to build a common latent space that captures the similarity of the poses with precision while allowing direct robot motion control from the latent space. For instance, we can generate in-between motion through simple linear interpolation between two projected human poses. We conduct a comprehensive evaluation of robot control from diverse modalities (i.e., texts, RGB videos, and key poses), which facilitates robot control for non-expert users. Our model outperforms existing works regarding human-to-robot retargeting in terms of efficiency and precision. Finally, we implemented our method in a real robot with self-collision avoidance through a whole-body controller to showcase the effectiveness of our approach. More information on our website https://evm7.github.io/UnsH2R/

Citations (11)

View on Semantic Scholar

Summary

The paper introduces an unsupervised encoder-decoder architecture that maps human and robot poses into a shared latent space using adaptive contrastive learning.
It employs a novel cross-domain similarity metric based on global rotations of body limbs to cluster similar poses and separate dissimilar ones.
Results demonstrate reduced mean square error and higher control frequency, highlighting its potential for efficient real-time human-robot interaction.

An Overview of ImitationNet for Human-to-Robot Motion Retargeting

In the paper titled "ImitationNet: Unsupervised Human-to-Robot Motion Retargeting via Shared Latent Space," the authors address the complex task of translating human motion to robot motion—a fundamental challenge in deploying robotics for natural human-robot interaction (HRI). This paper distinguishes itself by proposing an unsupervised deep learning approach, which alleviates the necessity for paired human-to-robot data to facilitate motion retargeting across diverse robotic platforms.

Methodology and Novel Contributions

At the core of the presented technique is an encoder-decoder architecture that establishes a shared latent space through adaptive contrastive learning. The novelty lies in bypassing the need for paired datasets, traditionally required in training models for motion retargeting, by constructing a latent space that uniformly represents human and robot poses. This is achieved by:

Cross-Domain Similarity Metric: The authors introduce a similarity measure based on global rotations of body limbs to encapsulate the visual likeness of human and robot poses. This metric is foundational for defining the structure of the shared latent space.
Encoder-Decoder Architecture: The implementation involves two encoders and one decoder, where human and robot pose data are projected into a unified latent space using two respective encoders. The decoder then transforms these latent representations into robot joint angles, capable of direct actuation.
Contrastive Learning: By employing a triplet loss function, the method enforces the clustering of similar poses and the separation of dissimilar ones within the latent space, thus facilitating effective unsupervised learning.

Results and Practical Implications

The proposed methodology was evaluated using both qualitative and quantitative metrics. Specifically, the mean square error (MSE) of predicted joint angles was notably lower than that of a supervised baseline model, demonstrating heightened precision without the need for paired data. Moreover, the retargeting process operates at a control frequency significantly higher than comparative approaches, enhancing real-time applicability.

The practical implications of this work are multifaceted. By effectively mapping motions from humans to robots through unsupervised learning, this method advances the potential for HRI in various domains, including entertainment, therapy, and industrial automation. Moreover, the ability to interpolate between key poses within the latent space introduces a level of motion fluidity crucial for certain applications, like animating lifelike robotic performances or enabling smooth transitions in robotic teleoperation tasks.

Future Directions

The paper points to possible advancements such as refining the similarity metrics further or integrating the shared latent space with broader contextual data, such as textual descriptions, to enhance the semantic understanding in motion retargeting tasks. As deep learning frameworks continue to evolve, these aspects could significantly bolster the adaptability and intelligence of robotic systems within human environments.

Conclusion

"ImitationNet" represents an important contribution to the field of robotics, providing a robust framework for translating human motion to robotic systems efficiently and precisely without the prior need for cumbersome paired datasets. This innovation not only enhances the scope of HRI but also paves the way for broader adoption of robots in everyday human contexts by reducing the complexity and cost of deployment across various robotic platforms.

PDF Markdown

Related Papers

GitHub

ImitationNet: Unsupervised human-to-robot motion retargeting via shared latent space

YouTube

Show All Videos