EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video (2505.11709v1)

Published 16 May 2025 in cs.CV, cs.LG, and cs.RO

Abstract: Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models.

Authors (5)

Ryan Hoque (19 papers)
Peide Huang (15 papers)
David J. Yoon (16 papers)
Mouli Sivapurapu (4 papers)
Jian Zhang (543 papers)

Summary

The paper "EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video" (Hoque et al., 16 May 2025 ) introduces a new large-scale dataset and benchmark designed to address the data scarcity problem in robot imitation learning, particularly for dexterous manipulation. The core idea is to leverage egocentric human video as a passively scalable data source, contrasting with labor-intensive robot teleoperation.

The EgoDex dataset is the primary contribution. Collected using Apple Vision Pro, it comprises 829 hours of 1080p, 30Hz egocentric video with 90 million frames and 338,000 task demonstrations across 194 distinct tabletop manipulation tasks. This makes it significantly larger and more diverse than existing datasets like Ego4D (which lacks fine-grained hand pose) or robot teleoperation datasets like DROID and BridgeData V2. Key properties of EgoDex making it suitable for dexterous manipulation learning include:

Passive Scalability: Data is a byproduct of human activity, similar to how text/images are generated online.
Rich Modalities: Includes high-resolution egocentric video, precise 3D pose data for the upper body and all 25 joints of each hand (collected via on-device SLAM and calibrated cameras), pose confidence scores, and detailed natural language annotations.
Behavioral Diversity: Covers a wide range of dexterous tasks beyond simple pick-and-place, such as tying shoelaces, folding laundry, unscrewing bottle caps, and dealing cards. Tasks are categorized as Reversible, Reset-free, and Reset, enabling efficient data collection.
Camera Information: Provides camera intrinsics and extrinsics, crucial for accurate 3D reasoning and reconstruction.

The authors define EgoDex Benchmarks for evaluating progress. The action representation for learning policies is defined as a 48-dimensional vector for each hand at each timestep, combining the 3D position of the wrist, 6D orientation of the wrist, and 3D positions of the five fingertips. Actions are predicted in chunks over a fixed time horizon ( $H$ ). Two main benchmark tasks are proposed:

Dexterous Trajectory Prediction: Given past egocentric video frames, skeletal poses, and a natural language task description, predict the future hand trajectories for a horizon $H$ .
Inverse Dynamics (Visually Goal-Conditioned): Similar to trajectory prediction, but also provided with a goal image corresponding to the desired state at the end of the horizon $H$ . This aims to mitigate the multimodality inherent in human motion.

Evaluation uses a "best of K" metric based on the average Euclidean distance in meters between the predicted 3D keypoint positions (wrists and fingertips) and the ground truth over the prediction horizon. This accounts for the potential multimodality of human demonstrations. A fixed 1% held-out test set ensures reproducible evaluations.

Experiments were conducted using state-of-the-art imitation learning policies (Behavior Cloning, Denoising Diffusion, Flow Matching) and Transformer architectures (encoder-decoder and decoder-only) from the X-IL framework. Practical findings include:

Encoder-decoder architectures generally outperformed decoder-only models.
Flow Matching and Denoising Diffusion models achieved lower error for K > 1 (capturing multimodality), while Behavior Cloning was better for K=1 (average prediction).
Predictive accuracy decreased as the prediction horizon increased.
Visual goal-conditioning significantly improved performance, especially the final distance error, by providing a clear visual target.
Performance scaled positively with the amount of training data used, reinforcing the value of large datasets.
Medium-sized models (200M parameters) were sufficient for the current dataset size, suggesting that significant gains might require even larger datasets or models.

Research Use Cases enabled by EgoDex span multiple fields:

Robotics: Training policies for humanoid robots with dexterous hands. Strategies for bridging the human-robot embodiment gap include co-training with small robot datasets, fine-tuning pre-trained human data policies, learning visual representations or manipulation priors from the human data, and then applying them to robot control.
Perception: Training models for egocentric action recognition, human-object interaction detection, tracking contact points, understanding object affordances, and analyzing complex tool use from an egocentric viewpoint.
Video Generation and World Models: EgoDex provides rich data (video + structured 3D pose + language) for training egocentric generative models and world models, which could simulate future visual states and potentially aid in decision-making for AI agents operating from a first-person perspective.

While EgoDex provides unprecedented scale and detail for dexterous manipulation data, the authors note limitations in scene diversity (primarily tabletop environments) and potential inaccuracies in pose tracking during heavy occlusion or high-speed movements. Future work aims to address scene diversity through data augmentation and potentially improve pose tracking.

The dataset is publicly available, aiming to accelerate research in imitation learning, computer vision, and foundation models by providing a rich, large-scale resource for understanding and replicating human dexterous skills. Access details and a full list of tasks and joint annotations are provided in the appendix.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/ryan_hoque/status/1925211579160854880

https://twitter.com/MilcentPedro/status/1925865992271327452

https://twitter.com/bohannon_bot/status/1925713597285446069

https://twitter.com/pablovelagomez1/status/1928109803819176052

Reddit

This paper says it uses AVP to collect upper body tracking data. Will this be available in the next OS? (18 points, 2 comments)