EgoMAN Dataset for 3D Hand Trajectory
- EgoMAN dataset is a comprehensive egocentric benchmark offering stage-aware 3D hand trajectory annotations with rich semantic, spatial, and motion labels.
- It organizes 300 hours of video into precise wrist-centric 6-DoF trajectories from diverse settings, enabling detailed analysis of hand-object interactions.
- The dataset includes structured QA pairs and rigorous evaluation metrics (e.g., ADE, FDE, DTW, Rot) to support robust model training and cross-domain generalization.
The EgoMAN dataset is a large-scale, egocentric benchmark designed for interaction stage-aware 3D hand trajectory prediction. Developed as part of "Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos" (Chen et al., 18 Dec 2025), it addresses limitations in prior datasets that decouple hand motion from semantic supervision and provides tightly linked semantic, spatial, and motion reasoning annotations. The dataset consists of approximately 300 hours of egocentric video, 219,000 wrist-centric 6-DoF trajectories sampled at 10 FPS, and 3 million structured QA (question-answer) pairs. Recordings are sourced from Aria-glass hardware across the EgoExo4D, Nymeria, and HOT3D-Aria collections, featuring both laboratory and in-the-wild scenes with varied object manipulations. Data structuring and annotation processes are tailored to enable holistic hand trajectory prediction and evaluation, incorporating explicit interaction stage segmentation, trajectory-token interfaces, and multimodal QA supervision.
1. Dataset Composition and Acquisition
EgoMAN comprises an extensive suite of egocentric human-interaction videos, totaling approximately 300 hours across roughly 300,000 five-second clips. Data is obtained using Meta Project Aria glasses equipped with fisheye cameras and IMUs. Scene diversity arises from three sources:
- HOT3D-Aria: Scripted laboratory-style manipulations (e.g. opening cabinets, handling tools).
- EgoExo4D: In-the-wild daily activities (e.g. cooking, bike repair).
- Nymeria: Large-scale, wild egocentric footage covering everyday tasks.
Both left and right wrists are tracked, generating 219,000 6-DoF trajectories at 10 Hz, aligned to the final pre-interaction visual frame. Trajectory representation follows position in camera coordinates and a continuous 6D rotation parameterization (referencing Zhou et al. 2019) using two three-dimensional vectors analogous to a quaternion. Participant demographics are anonymized, with coverage spanning more than 1,500 scenes.
2. Interaction Stage Segmentation and Semantic Labeling
Hand-object interaction segments are atomically subdivided into:
- Approach: The hand advances toward the manipulation region but has not yet contacted the object.
- Manipulation: Upon contact, the hand executes the atomic action (e.g. grasp, lift).
Annotation is performed at atomic-action timestamps using GPT-4.1 prompted over five-second clips, with HOT3D segments automatically inferred from object-motion onset analysis (0.5–2 seconds before/after motion). For each segment, trajectory-encoded semantic tokens include:
| Trajectory Token | Description |
|---|---|
<ACT> |
Encodes atomic action phrase |
<START> |
Approach stage onset waypoint (t=0) |
<CONTACT> |
Manipulation onset waypoint (first contact) |
<END> |
Manipulation completion waypoint |
Waypoint labels comprise timestamp, 3D position, and 6D rotation, supervised only when visibile to avoid annotation ambiguity.
3. Structured Question-Answer Annotation Scheme
EgoMAN features approximately 3 million QA pairs distributed as follows:
- Semantic reasoning: 21.6%
- Spatial reasoning: 42.6%
- Motion reasoning: 35.8%
QA generation leverages GPT-4.1 with custom prompts enforcing grounding in interaction annotation and outputs are structured in JSON arrays. Quality assurance is performed via image-conditioned LLM judgments to filter ambiguous or low-quality samples. Representative QA examples include:
- Semantic: “What will be the next atomic action?” → “grasp the red can”
- Spatial: “Where and when will the left hand complete the manipulation?” → “at (0.32 m, –0.15 m, 0.8 m) at 1.4 s”
- Motion: “Given the past 0.5 s of motion, where will the right hand complete the approach stage?” → “at (–0.05 m, 0.20 m, 0.75 m)”
4. Data Structure, Splits, and Access
Splits are organized at the scene level (>1,500 scenes) as:
| Split | Scenes (%) | Trajectories | QA Pairs | Description |
|---|---|---|---|---|
| Pretraining | 1,014 (64%) | 74,000 | 1,000,000 (low/noise) | Lower-quality, broad coverage |
| Finetuning | 498 (31%) | 17,000 | ≈Y | High-quality trajectories |
| Testing | 78 (5%) | EgoMAN-Unseen: 2,844<br>HOT3D-OOD: 990 | Held-out scenes for domain/generalization benchmarks |
Data formats include JSON for trajectory sequences (timestamps, , rotation for <START>, <CONTACT>, <END>, and full 10 Hz sequence), QA in per-clip JSON arrays, and video frames as image sequences or MP4s organized by source/scene/clip hierarchy. Licensing adheres to the foundational requirements of EgoExo4D, Nymeria, and HOT3D-Aria, with EgoMAN annotations and splits slated for release at https://egoman-project.github.io/ (access via institutional email).
5. Evaluation Protocols and Baseline Results
Trajectory prediction evaluation leverages four primary metrics:
- ADE (m): average Euclidean distance over all future timesteps
- FDE (m): final timestep Euclidean distance
- DTW (m): dynamic time warping distance (shape/timing)
- Rot (°): mean geodesic rotation error
Baseline and state-of-the-art results for EgoMAN-Unseen and HOT3D-OOD splits (best-of-10 sampling):
| Method | ADE↓ | FDE↓ | DTW↓ | Rot↓ | ADE↓ | FDE↓ | DTW↓ | Rot↓ |
|---|---|---|---|---|---|---|---|---|
| Unseen | OOD | |||||||
| USST* | 0.233 | 0.394 | 0.220 | 46.98 | 0.245 | 0.409 | 0.226 | 55.80 |
| MMTwin* | 0.206 | 0.256 | 0.204 | 48.98 | 0.209 | 0.259 | 0.207 | 44.37 |
| HandsOnVLM* | 0.171 | 0.228 | 0.161 | 35.22 | 0.194 | 0.262 | 0.186 | 38.13 |
| FM-base | 0.160 | 0.229 | 0.144 | 37.00 | 0.161 | 0.237 | 0.147 | 39.47 |
| EgoMAN (Ours) | 0.124 | 0.179 | 0.111 | 32.75 | 0.141 | 0.217 | 0.130 | 35.09 |
Waypoint-only evaluation (EgoMAN-WP):
| Method | Contact↓ | Traj↓ | FPS↑ |
|---|---|---|---|
| VRB* | 0.300 | 0.271 | 0.03 |
| VidBot | 0.290 | 0.269 | 0.04 |
| EgoMAN-WP | 0.192 | 0.127 | 3.45 |
Ablation studies reveal:
- Reasoning pre-training alone (no FM): ADE rises from 0.150 → 0.215 (Unseen).
- Flow-Matching pre-training alone (no reasoning): ADE ≈ 0.188 → 0.215.
- Combination with explicit 6-DoF waypoints yields optimal ADE: 0.151 → 0.124.
6. Problem Formalization and Loss Functions
The core formulation is:
Principal losses include:
- Reasoning Losses:
- : next-token LM loss
- :
- if :
- else:
-
Total Loss:
- Flow-Matching Loss:
- Inference:
Key figures outline the data/model pipeline, reasoning module, trajectory-token interface, flow-matching expert, prediction results, waypoint accuracy, and scaling/ablation findings.
7. Research Context and Utility
EgoMAN advances egocentric 3D hand trajectory prediction by integrating semantic, spatial, and motion supervision, overcoming prior data-model dissociations. Its stage-aware annotation, structured QA pairs, and rigorous evaluation metrics facilitate end-to-end learning, model comparison, and generalization analysis across diverse real-world and scripted domains. The domain-held-out splits (EgoMAN-Unseen, HOT3D-OOD) support cross-domain generalization tasks. Data and benchmarks provide a foundation for vision-language reasoning, action anticipation, and trajectory modeling in human-robot interaction, AR/VR, and understanding of natural hand manipulations.
The dataset, along with its access protocols, evaluation schemes, and detailed annotation design, constitutes a comprehensive benchmark for stage-aware, semantically-linked 3D hand trajectory modeling in egocentric contexts (Chen et al., 18 Dec 2025).