Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 30 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 116 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

OpenEgo: A Large-Scale Multimodal Egocentric Dataset for Dexterous Manipulation (2509.05513v1)

Published 5 Sep 2025 in cs.CV, cs.AI, and cs.RO

Abstract: Egocentric human videos provide scalable demonstrations for imitation learning, but existing corpora often lack either fine-grained, temporally localized action descriptions or dexterous hand annotations. We introduce OpenEgo, a multimodal egocentric manipulation dataset with standardized hand-pose annotations and intention-aligned action primitives. OpenEgo totals 1107 hours across six public datasets, covering 290 manipulation tasks in 600+ environments. We unify hand-pose layouts and provide descriptive, timestamped action primitives. To validate its utility, we train language-conditioned imitation-learning policies to predict dexterous hand trajectories. OpenEgo is designed to lower the barrier to learning dexterous manipulation from egocentric video and to support reproducible research in vision-language-action learning. All resources and instructions will be released at www.openegocentric.com.

Summary

The paper introduces a unified dataset combining 1107 hours of egocentric video with standardized 3D hand-pose annotations for dexterous manipulation research.
It details a harmonization pipeline that maps heterogeneous sources to the MANO 21-joint format and augments videos with temporally localized language primitives.
Baseline experiments using a ViLT-based policy show effective short-horizon hand trajectory prediction, highlighting scalable potential for vision-language-action learning.

OpenEgo: A Unified Multimodal Egocentric Dataset for Dexterous Manipulation

Motivation and Context

The OpenEgo dataset addresses a critical bottleneck in dexterous robotic manipulation: the lack of large-scale, unified egocentric datasets with both fine-grained, temporally localized action annotations and standardized 3D hand-pose trajectories. Prior datasets either lack dexterous hand labels, intention-aligned language primitives, or sufficient scale and diversity. This fragmentation impedes progress in vision-language-action (VLA) models, imitation learning, and hierarchical policy architectures that require temporally and semantically rich supervision. OpenEgo consolidates six major egocentric datasets, standardizes their hand-pose representations, and augments them with detailed, timestamped action primitives, thereby providing a comprehensive resource for the community.

Dataset Construction and Standardization

OpenEgo comprises 1107 hours of egocentric video, 119.6 million frames, 290 manipulation tasks, and over 344,000 recordings, spanning 600+ environments and 1,400 distinct objects. The environments include diverse settings such as kitchens, assembly lines, and daily activity spaces, with at least 258 unique participants. The dataset unifies hand-pose annotations to the MANO 21-joint format, expressed in the camera coordinate frame, and provides a binary visibility mask for each joint to account for occlusions and missing data.

The hand-pose standardization pipeline is dataset-specific:

For sources lacking native 3D hand poses, 2D landmarks are detected and back-projected using per-pixel depth and camera intrinsics.
For datasets with world-frame hand poses, rigid transformations using per-frame extrinsics convert poses to the camera frame.
All hand-pose data are mapped to the MANO-21 joint layout, with non-MANO joints dropped and reindexed as necessary.

This harmonization enables consistent downstream learning and evaluation across heterogeneous sources.

Intention-Aligned Language Primitives

A distinguishing feature of OpenEgo is its comprehensive annotation of intention-aligned action primitives. Each primitive specifies the manipulated object, action, actor (left/right/both hands), and absolute temporal boundaries (start and end timestamps). High-level task labels are also provided. The annotation protocol ensures that only directly observed actions are labeled, with gaps retained to reflect natural task structure. This dual-level annotation supports both hierarchical policy learning and fine-grained action segmentation.

Figure 1: Illustration of high- and low-level task annotations in OpenEgo.

Experimental Protocol and Baseline Results

The primary evaluation task is language-conditioned 3D hand-trajectory prediction. Each frame is represented by stacked left/right hand joints ( $q_t \in \mathbb{R}^{42 \times 3}$ ) and a binary visibility mask. Given an RGB observation, a language prompt describing the intended manipulation, and the current hand configuration, the policy predicts the next $T$ hand configurations.

A ViLT-based policy is trained with a masked mean-squared error objective, ignoring invisible joints. Training is conducted on a 0.1% subset of OpenEgo (13.44M sampled instances), with 10% held out for evaluation. The model is optimized for 15,000 steps using AdamW and cosine annealing, on dual NVIDIA RTX 4090 GPUs.

Performance is measured using three metrics:

AED (Average Euclidean Distance): Mean per-joint distance over the prediction horizon.
FED (Final-step Euclidean Distance): Distance at the final predicted frame.
DTW (Dynamic Time Warping): Alignment cost between predicted and reference joint sequences.

Results indicate that short-horizon predictions yield lower errors, with AED and FED increasing smoothly as the horizon extends. DTW grows more rapidly, reflecting its sensitivity to temporal misalignment. The model demonstrates effective short-horizon dexterous motion prediction, with structured error scaling as a function of task difficulty and prediction length.

Limitations and Considerations

Several limitations are acknowledged:

Missing Data: Hand joints are absent in some frames due to occlusions or annotation gaps; visibility masks are provided but may limit learning in highly occluded scenarios.
Annotation Quality: Language primitives are partially auto-generated and only partially verified, introducing potential temporal drift and ambiguity, especially for long or complex actions.
3D Joint Estimation: For sources lacking native 3D hand labels, the quality of reconstructed joints depends on the accuracy of 2D landmark detection and depth sensing.
Experimental Scope: Baseline experiments utilize only a small fraction of the dataset and a single architecture; results should not be interpreted as upper bounds.

Implications and Future Directions

OpenEgo provides a unified, large-scale resource for research in dexterous manipulation, imitation learning, and VLA modeling. Its standardized hand-pose and language annotations enable reproducible benchmarking and facilitate the development of hierarchical and multimodal policies. The dataset is particularly well-suited for training world models, foundation VLMs, and hierarchical VLA architectures that require temporally and semantically rich supervision.

Potential future developments include:

Scaling Experiments: Leveraging the full dataset for large-scale pretraining and fine-tuning of advanced policy architectures.
Improved Annotation Pipelines: Enhancing the quality and coverage of language primitives through human verification and active learning.
Cross-Domain Transfer: Investigating transfer learning from egocentric human demonstrations to robotic platforms, leveraging the unified hand-pose representation.
Privacy and Ethics: Continued attention to privacy-preserving data release and responsible use, given the egocentric nature of the source material.

Conclusion

OpenEgo represents a significant step toward closing the gap between large-scale egocentric video and the requirements of dexterous manipulation learning. By consolidating diverse sources, standardizing hand-pose annotations, and providing intention-aligned language primitives, it establishes a new benchmark for research in vision-language-action learning and dexterous policy development. The dataset's scale, diversity, and annotation richness are poised to accelerate progress in both theoretical and applied aspects of embodied AI.