Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation (2410.11792v1)

Published 15 Oct 2024 in cs.RO, cs.AI, cs.CV, and cs.LG

Abstract: We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation. More videos can be found on our website https://ut-austin-rpl.github.io/OKAMI/.

Citations (9)

Summary

  • The paper introduces a novel framework that teaches humanoid robot manipulation from a single RGB-D video demonstration using object-aware retargeting.
  • It leverages open-world vision models and an improved SLAHMR algorithm to generate precise reference plans and subgoal detections.
  • Experiments show a task success rate of 71.7% and 79.2% with closed-loop visuomotor policies, outperforming existing methods.

Analysis of OKAMI: Single-Video Imitation for Humanoid Robots

The paper "OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation" introduces OKAMI, a novel framework aimed at enabling humanoid robots to learn manipulation skills from single RGB-D video demonstrations. The primary focus is on object-aware retargeting, allowing humanoid robots to mimic human actions while adapting to varying object locations within distinct environments. This method demonstrates significant advancements in generalizing robot imitation from observation under diverse spatial and visual conditions, surpassing existing state-of-the-art methods.

Key Contributions

The OKAMI framework is composed of two main stages:

  1. Reference Plan Generation:
    • OKAMI leverages open-world vision models to identify task-relevant objects without manual annotations.
    • Human motions are reconstructed using an improved SLAHMR algorithm, incorporating the SMPL-H model for body and hand poses.
    • It creates a manipulation plan based on detected subgoals, identified through unsupervised temporal segmentation and geometric/semantic heuristics.
  2. Object-Aware Retargeting:
    • The framework applies a factorized process, separately adapting arm and hand motions.
    • Retargets the human motions to the humanoid by adjusting the trajectories based on object locations, utilizing inverse kinematics to generate feasible joint commands.

Experimental Insights

The framework was tested on several tasks requiring nuanced manipulation, such as placing objects and interacting with articulated components. The experiments reveal that OKAMI achieved a substantial task success rate of 71.7%, outperforming the baseline method, ORION, by a significant margin. Closed-loop visuomotor policies trained with OKAMI-generated rollouts reached an impressive average success rate of 79.2%.

Implications and Future Directions

The success of OKAMI indicates a promising direction for simplifying the data acquisition process in imitation learning, reducing reliance on teleoperation. The deployment of humanoid robots in open-world settings becomes more feasible through such advancements.

However, the model currently relies on RGB-D input, suggesting an opportunity to extend this utility towards standard RGB video sources from the Internet, further enhancing its applicability. Additionally, the integration of lower body locomotion for complex tasks could broaden the practical deployment of humanoids trained via video imitation.

Conclusion

The methodologies proposed in OKAMI make notable strides in the domain of video-based imitation learning for humanoids. Its object-aware retargeting provides a robust mechanism for real-world deployment, adapting efficiently to spatial and visual variations. This research pushes the boundaries of robot learning from minimal inputs, holding potential for scalable, efficient robot training paradigms in increasingly complex tasks and environments.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com