Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Pro

GPT-5

GPT-4o

DeepSeek R1 via Azure

2000 character limit reached

EgoMimic: Scaling Imitation Learning via Egocentric Video (2410.24221v1)

Published 31 Oct 2024 in cs.RO and cs.CV

Abstract: The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data, specifically egocentric human videos paired with 3D hand tracking. EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on human and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show a favorable scaling trend for EgoMimic, where adding 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data. Videos and additional information can be found at https://egomimic.github.io/

References (53)

Citations (1)

View on Semantic Scholar

Summary

The paper proposes EgoMimic, a framework leveraging egocentric human video with 3D hand tracking as a scalable data source for imitation learning.
EgoMimic achieved substantial performance gains (8-33% success rate, 34-228% score) on complex manipulation and dual-arm coordination tasks compared to baselines.
The method shows strong generalization to new objects and environments and demonstrates superior scaling properties from human data over robot data.

Overview of EgoMimic: Scaling Imitation Learning via Egocentric Video

The paper "EgoMimic: Scaling Imitation Learning via Egocentric Video" addresses a key challenge in the field of imitation learning: the scaling and diversification of demonstration data necessary for effective learning. This paper proposes a novel method, EgoMimic, which utilizes egocentric human videos paired with 3D hand tracking as a scalable data source for imitation learning. The approach marks a departure from traditional methods that predominantly rely on teleoperation data, by treating human and robot data as equally valuable sources.

Methodology

Data Capture and Processing: The paper introduces a full-stack framework called EgoMimic, designed to scale manipulation via human embodiment data. The data is captured using ergonomic Project Aria glasses, which provide egocentric RGB video, 3D hand tracking, and device SLAM information. This data collection method enables passive scalability similar to the data sources that have propelled advances in computer vision and natural language processing.

Hardware System: The authors developed a low-cost bimanual robot system that attempts to bridge the kinematic and observational gap between human and robotic systems. The robot mimics the kinematic range of human arms and uses Project Aria glasses to align its visual frame with human demonstrations.

Cross-Domain Alignment: Critical to the effectiveness of EgoMimic is its data alignment techniques, which address the inherent differences in kinematics, distributions, and appearance between human and robot data. Key innovative steps include transforming human hand trajectories to align with robot end-effector actions and normalizing action distributions to counter discrepancies.

Unified Imitation Learning Architecture: The architecture enables co-training using both human and robot data, thereby leveraging their synergies to improve task performance. This contrasts with approaches that handle these data sources separately, resulting in constrained scalability and generalization potential.

Results and Evaluation

The framework was empirically evaluated on complex long-horizon tasks, including object manipulation and dual-arm coordination. EgoMimic demonstrated substantial performance improvements, yielding task success rate improvements ranging between 8-33% and task score improvements by 34-228% over baseline methods. The framework’s ability to generalize to unseen scenarios was particularly notable, demonstrating performance in both new object categories and new environments without additional robot data training.

Moreover, the evaluation highlighted the favorable scaling properties of human data compared to robot data. Additional human data significantly boosted the performance more effectively than equivalent amounts of additional robot data, underscoring the potential for passive data scalability.

Implications and Future Directions

The implications of EgoMimic’s methodology span both practical and theoretical aspects of robot learning. Practically, it suggests a pathway toward leveraging consumer-grade devices for scalable data aggregation in robotics, akin to the massive datasets that have catalyzed progress in vision and language domains. Theoretically, it challenges the conventional boundaries between human demonstration data and robotic teleoperation by suggesting a continuous spectrum of embodiment data sources.

Future research could focus on expanding the scope of generalized policies to new robotic embodiments and behaviors learned solely from human data. Investigating the translation of this approach to other robotic tasks and environments could further validate its utility and scalability. Moreover, the integration of more sophisticated cross-embodiment learning techniques could potentially enhance generalization across diverse task domains.

The paper offers a compelling method to bridge current limitations in robotic learning and provides substantial groundwork for future advancements in scalable imitation learning.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (8)

GitHub

Tweets

https://twitter.com/danfei_xu/status/1858903230538760611

https://twitter.com/gm8xx8/status/1852171728568381767

https://twitter.com/archerlin__/status/1939695219429253518