Papers
Topics
Authors
Recent
2000 character limit reached

EgoMAN Dataset for 3D Hand Trajectory

Updated 20 December 2025
  • EgoMAN dataset is a comprehensive egocentric benchmark offering stage-aware 3D hand trajectory annotations with rich semantic, spatial, and motion labels.
  • It organizes 300 hours of video into precise wrist-centric 6-DoF trajectories from diverse settings, enabling detailed analysis of hand-object interactions.
  • The dataset includes structured QA pairs and rigorous evaluation metrics (e.g., ADE, FDE, DTW, Rot) to support robust model training and cross-domain generalization.

The EgoMAN dataset is a large-scale, egocentric benchmark designed for interaction stage-aware 3D hand trajectory prediction. Developed as part of "Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos" (Chen et al., 18 Dec 2025), it addresses limitations in prior datasets that decouple hand motion from semantic supervision and provides tightly linked semantic, spatial, and motion reasoning annotations. The dataset consists of approximately 300 hours of egocentric video, 219,000 wrist-centric 6-DoF trajectories sampled at 10 FPS, and 3 million structured QA (question-answer) pairs. Recordings are sourced from Aria-glass hardware across the EgoExo4D, Nymeria, and HOT3D-Aria collections, featuring both laboratory and in-the-wild scenes with varied object manipulations. Data structuring and annotation processes are tailored to enable holistic hand trajectory prediction and evaluation, incorporating explicit interaction stage segmentation, trajectory-token interfaces, and multimodal QA supervision.

1. Dataset Composition and Acquisition

EgoMAN comprises an extensive suite of egocentric human-interaction videos, totaling approximately 300 hours across roughly 300,000 five-second clips. Data is obtained using Meta Project Aria glasses equipped with fisheye cameras and IMUs. Scene diversity arises from three sources:

  • HOT3D-Aria: Scripted laboratory-style manipulations (e.g. opening cabinets, handling tools).
  • EgoExo4D: In-the-wild daily activities (e.g. cooking, bike repair).
  • Nymeria: Large-scale, wild egocentric footage covering everyday tasks.

Both left and right wrists are tracked, generating 219,000 6-DoF trajectories at 10 Hz, aligned to the final pre-interaction visual frame. Trajectory representation follows position in camera coordinates pt=(xt,yt,zt)p_t = (x_t, y_t, z_t) and a continuous 6D rotation parameterization (referencing Zhou et al. 2019) using two three-dimensional vectors analogous to a quaternion. Participant demographics are anonymized, with coverage spanning more than 1,500 scenes.

2. Interaction Stage Segmentation and Semantic Labeling

Hand-object interaction segments are atomically subdivided into:

  1. Approach: The hand advances toward the manipulation region but has not yet contacted the object.
  2. Manipulation: Upon contact, the hand executes the atomic action (e.g. grasp, lift).

Annotation is performed at atomic-action timestamps using GPT-4.1 prompted over five-second clips, with HOT3D segments automatically inferred from object-motion onset analysis (0.5–2 seconds before/after motion). For each segment, trajectory-encoded semantic tokens include:

Trajectory Token Description
<ACT> Encodes atomic action phrase
<START> Approach stage onset waypoint (t=0)
<CONTACT> Manipulation onset waypoint (first contact)
<END> Manipulation completion waypoint

Waypoint labels comprise timestamp, 3D position, and 6D rotation, supervised only when visibile to avoid annotation ambiguity.

3. Structured Question-Answer Annotation Scheme

EgoMAN features approximately 3 million QA pairs distributed as follows:

  • Semantic reasoning: 21.6%
  • Spatial reasoning: 42.6%
  • Motion reasoning: 35.8%

QA generation leverages GPT-4.1 with custom prompts enforcing grounding in interaction annotation and outputs are structured in JSON arrays. Quality assurance is performed via image-conditioned LLM judgments to filter ambiguous or low-quality samples. Representative QA examples include:

  • Semantic: “What will be the next atomic action?” → “grasp the red can”
  • Spatial: “Where and when will the left hand complete the manipulation?” → “at (0.32 m, –0.15 m, 0.8 m) at 1.4 s”
  • Motion: “Given the past 0.5 s of motion, where will the right hand complete the approach stage?” → “at (–0.05 m, 0.20 m, 0.75 m)”

4. Data Structure, Splits, and Access

Splits are organized at the scene level (>1,500 scenes) as:

Split Scenes (%) Trajectories QA Pairs Description
Pretraining 1,014 (64%) 74,000 1,000,000 (low/noise) Lower-quality, broad coverage
Finetuning 498 (31%) 17,000 ≈Y High-quality trajectories
Testing 78 (5%) EgoMAN-Unseen: 2,844<br>HOT3D-OOD: 990 Held-out scenes for domain/generalization benchmarks

Data formats include JSON for trajectory sequences (timestamps, ptp_t, rotationt_t for <START>, <CONTACT>, <END>, and full 10 Hz sequence), QA in per-clip JSON arrays, and video frames as image sequences or MP4s organized by source/scene/clip hierarchy. Licensing adheres to the foundational requirements of EgoExo4D, Nymeria, and HOT3D-Aria, with EgoMAN annotations and splits slated for release at https://egoman-project.github.io/ (access via institutional email).

5. Evaluation Protocols and Baseline Results

Trajectory prediction evaluation leverages four primary metrics:

  • ADE (m): average Euclidean distance over all future timesteps
  • FDE (m): final timestep Euclidean distance
  • DTW (m): dynamic time warping distance (shape/timing)
  • Rot (°): mean geodesic rotation error

Baseline and state-of-the-art results for EgoMAN-Unseen and HOT3D-OOD splits (best-of-10 sampling):

Method ADE↓ FDE↓ DTW↓ Rot↓ ADE↓ FDE↓ DTW↓ Rot↓
Unseen OOD
USST* 0.233 0.394 0.220 46.98 0.245 0.409 0.226 55.80
MMTwin* 0.206 0.256 0.204 48.98 0.209 0.259 0.207 44.37
HandsOnVLM* 0.171 0.228 0.161 35.22 0.194 0.262 0.186 38.13
FM-base 0.160 0.229 0.144 37.00 0.161 0.237 0.147 39.47
EgoMAN (Ours) 0.124 0.179 0.111 32.75 0.141 0.217 0.130 35.09

Waypoint-only evaluation (EgoMAN-WP):

Method Contact↓ Traj↓ FPS↑
VRB* 0.300 0.271 0.03
VidBot 0.290 0.269 0.04
EgoMAN-WP 0.192 0.127 3.45

Ablation studies reveal:

  • Reasoning pre-training alone (no FM): ADE rises from 0.150 → 0.215 (Unseen).
  • Flow-Matching pre-training alone (no reasoning): ADE ≈ 0.188 → 0.215.
  • Combination with explicit 6-DoF waypoints yields optimal ADE: 0.151 → 0.124.

6. Problem Formalization and Loss Functions

The core formulation is:

F:(Vt,{LtHt,RtHt},Intent I){L^t+1t+T,R^t+1t+T}F: (V_t, \{L_{t-H\dots t}, R_{t-H\dots t}\}, \text{Intent } I) \rightarrow \{\hat{L}_{t+1\dots t+T}, \hat{R}_{t+1\dots t+T}\}

Principal losses include:

  • Reasoning Losses:
    • LtextL_{text}: next-token LM loss
    • LactL_{act}:
    • if K<κK<\kappa:

    11Kicos(zi,zi+)1-\frac{1}{K} \sum_i \cos(z_i,z_i^+) - else:

    1Kilogexp(sim(zi,zi+)/τ)jexp(sim(zi,zj+)/τ)-\frac{1}{K}\sum_i \log\frac{\exp(\mathrm{sim}(z_i,z_i^+)/\tau)}{\sum_j \exp(\mathrm{sim}(z_i,z_j^+)/\tau)} - Lwp=λtLtime+λ3DL3D+λ2DL2D+λrLrot6D+λgeoLgeoL_{wp} = \lambda_t L_{time} + \lambda_{3D} L_{3D} + \lambda_{2D} L_{2D} + \lambda_{r} L_{rot6D} + \lambda_{geo} L_{geo}

  • Total Loss:

    • Ltotal=Ltext+λwpLwp+λactLactL_{total} = L_{text} + \lambda_{wp} L_{wp} + \lambda_{act} L_{act}
  • Flow-Matching Loss:
    • LFM=v^(x1x0)22L_{FM} = \| \hat{v} - (x_1 - x_0) \|_2^2
    • Inference: xk+1=xk+Δtv^(xk,tk),Δt=1/Nx_{k+1} = x_k + \Delta t \cdot \hat{v}(x_k, t_k), \Delta t = 1/N

Key figures outline the data/model pipeline, reasoning module, trajectory-token interface, flow-matching expert, prediction results, waypoint accuracy, and scaling/ablation findings.

7. Research Context and Utility

EgoMAN advances egocentric 3D hand trajectory prediction by integrating semantic, spatial, and motion supervision, overcoming prior data-model dissociations. Its stage-aware annotation, structured QA pairs, and rigorous evaluation metrics facilitate end-to-end learning, model comparison, and generalization analysis across diverse real-world and scripted domains. The domain-held-out splits (EgoMAN-Unseen, HOT3D-OOD) support cross-domain generalization tasks. Data and benchmarks provide a foundation for vision-language reasoning, action anticipation, and trajectory modeling in human-robot interaction, AR/VR, and understanding of natural hand manipulations.

The dataset, along with its access protocols, evaluation schemes, and detailed annotation design, constitutes a comprehensive benchmark for stage-aware, semantically-linked 3D hand trajectory modeling in egocentric contexts (Chen et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to EgoMAN Dataset.