- The paper introduces a unified dataset combining 1107 hours of egocentric video with standardized 3D hand-pose annotations for dexterous manipulation research.
- It details a harmonization pipeline that maps heterogeneous sources to the MANO 21-joint format and augments videos with temporally localized language primitives.
- Baseline experiments using a ViLT-based policy show effective short-horizon hand trajectory prediction, highlighting scalable potential for vision-language-action learning.
OpenEgo: A Unified Multimodal Egocentric Dataset for Dexterous Manipulation
Motivation and Context
The OpenEgo dataset addresses a critical bottleneck in dexterous robotic manipulation: the lack of large-scale, unified egocentric datasets with both fine-grained, temporally localized action annotations and standardized 3D hand-pose trajectories. Prior datasets either lack dexterous hand labels, intention-aligned language primitives, or sufficient scale and diversity. This fragmentation impedes progress in vision-language-action (VLA) models, imitation learning, and hierarchical policy architectures that require temporally and semantically rich supervision. OpenEgo consolidates six major egocentric datasets, standardizes their hand-pose representations, and augments them with detailed, timestamped action primitives, thereby providing a comprehensive resource for the community.
Dataset Construction and Standardization
OpenEgo comprises 1107 hours of egocentric video, 119.6 million frames, 290 manipulation tasks, and over 344,000 recordings, spanning 600+ environments and 1,400 distinct objects. The environments include diverse settings such as kitchens, assembly lines, and daily activity spaces, with at least 258 unique participants. The dataset unifies hand-pose annotations to the MANO 21-joint format, expressed in the camera coordinate frame, and provides a binary visibility mask for each joint to account for occlusions and missing data.
The hand-pose standardization pipeline is dataset-specific:
- For sources lacking native 3D hand poses, 2D landmarks are detected and back-projected using per-pixel depth and camera intrinsics.
- For datasets with world-frame hand poses, rigid transformations using per-frame extrinsics convert poses to the camera frame.
- All hand-pose data are mapped to the MANO-21 joint layout, with non-MANO joints dropped and reindexed as necessary.
This harmonization enables consistent downstream learning and evaluation across heterogeneous sources.
Intention-Aligned Language Primitives
A distinguishing feature of OpenEgo is its comprehensive annotation of intention-aligned action primitives. Each primitive specifies the manipulated object, action, actor (left/right/both hands), and absolute temporal boundaries (start and end timestamps). High-level task labels are also provided. The annotation protocol ensures that only directly observed actions are labeled, with gaps retained to reflect natural task structure. This dual-level annotation supports both hierarchical policy learning and fine-grained action segmentation.
Figure 1: Illustration of high- and low-level task annotations in OpenEgo.
Experimental Protocol and Baseline Results
The primary evaluation task is language-conditioned 3D hand-trajectory prediction. Each frame is represented by stacked left/right hand joints (qt​∈R42×3) and a binary visibility mask. Given an RGB observation, a language prompt describing the intended manipulation, and the current hand configuration, the policy predicts the next T hand configurations.
A ViLT-based policy is trained with a masked mean-squared error objective, ignoring invisible joints. Training is conducted on a 0.1% subset of OpenEgo (13.44M sampled instances), with 10% held out for evaluation. The model is optimized for 15,000 steps using AdamW and cosine annealing, on dual NVIDIA RTX 4090 GPUs.
Performance is measured using three metrics:
- AED (Average Euclidean Distance): Mean per-joint distance over the prediction horizon.
- FED (Final-step Euclidean Distance): Distance at the final predicted frame.
- DTW (Dynamic Time Warping): Alignment cost between predicted and reference joint sequences.
Results indicate that short-horizon predictions yield lower errors, with AED and FED increasing smoothly as the horizon extends. DTW grows more rapidly, reflecting its sensitivity to temporal misalignment. The model demonstrates effective short-horizon dexterous motion prediction, with structured error scaling as a function of task difficulty and prediction length.
Limitations and Considerations
Several limitations are acknowledged:
- Missing Data: Hand joints are absent in some frames due to occlusions or annotation gaps; visibility masks are provided but may limit learning in highly occluded scenarios.
- Annotation Quality: Language primitives are partially auto-generated and only partially verified, introducing potential temporal drift and ambiguity, especially for long or complex actions.
- 3D Joint Estimation: For sources lacking native 3D hand labels, the quality of reconstructed joints depends on the accuracy of 2D landmark detection and depth sensing.
- Experimental Scope: Baseline experiments utilize only a small fraction of the dataset and a single architecture; results should not be interpreted as upper bounds.
Implications and Future Directions
OpenEgo provides a unified, large-scale resource for research in dexterous manipulation, imitation learning, and VLA modeling. Its standardized hand-pose and language annotations enable reproducible benchmarking and facilitate the development of hierarchical and multimodal policies. The dataset is particularly well-suited for training world models, foundation VLMs, and hierarchical VLA architectures that require temporally and semantically rich supervision.
Potential future developments include:
- Scaling Experiments: Leveraging the full dataset for large-scale pretraining and fine-tuning of advanced policy architectures.
- Improved Annotation Pipelines: Enhancing the quality and coverage of language primitives through human verification and active learning.
- Cross-Domain Transfer: Investigating transfer learning from egocentric human demonstrations to robotic platforms, leveraging the unified hand-pose representation.
- Privacy and Ethics: Continued attention to privacy-preserving data release and responsible use, given the egocentric nature of the source material.
Conclusion
OpenEgo represents a significant step toward closing the gap between large-scale egocentric video and the requirements of dexterous manipulation learning. By consolidating diverse sources, standardizing hand-pose annotations, and providing intention-aligned language primitives, it establishes a new benchmark for research in vision-language-action learning and dexterous policy development. The dataset's scale, diversity, and annotation richness are poised to accelerate progress in both theoretical and applied aspects of embodied AI.