EgoMAN: Egocentric 3D Hand Trajectory Model

Updated 20 December 2025

EgoMAN is a unified reasoning-to-motion framework that integrates high-level vision-language intent with low-level 6-DoF hand trajectory prediction via a discrete trajectory-token interface.
The model employs a two-stage pipeline with specialized vision, language, and past-motion encoders combined with a multimodal transformer for precise waypoint and motion generation.
Empirical results demonstrate significant ADE reductions and robust generalization across diverse real-world scenes, establishing its superiority over prior methods.

EgoMAN is a reasoning-to-motion framework for egocentric 3D hand trajectory prediction, introduced to link high-level vision-language reasoning with low-level 6-DoF motion generation, structured around a trajectory-token interface. Developed alongside the EgoMAN dataset, which provides extensive egocentric interaction data—including 219,000 6-DoF trajectories and 3 million structured question–answer pairs—EgoMAN addresses prior decoupling between semantic reasoning and motion, offering a unified, stage-aware approach with strong generalization across diverse real-world scenes (Chen et al., 18 Dec 2025).

1. Architectural Overview

EgoMAN employs a two-stage pipeline that decouples reasoning and motion generation but bridges them via special trajectory tokens:

Vision Encoder: Uses a frozen multimodal backbone composed of Qwen2.5-VL’s vision encoder and DINOv3. This component extracts both global intent cues and local scene geometry from RGB video frames. Each frame is represented by a 768-dimensional feature vector.
Language Encoder: Processes intent queries or QA questions with the Qwen2.5-VL text encoder, producing 768-dimensional contextual embeddings suitable for either reasoning or next-token prediction.
Past-Motion Encoder: Recent five 6-DoF wrist poses (sampled at 10 Hz)—3D position paired with a 6D over-parameterization of SO(3) [Zhou et al., 2019]—are embedded via a 4-layer MLP into the same 768-D latent space.
Reasoning Module: Built on Qwen2.5-VL’s multimodal transformer, this module ingests visual, linguistic, and motion-historic inputs. It outputs either a natural-language answer (for QA) or a four-token trajectory specification: <ACT> (intent), <START>, <CONTACT>, and <END> (stage-aware waypoints). Each waypoint token is mapped by an MLP to continuous-valued (timestamp, 3D position, 6D rotation).
Trajectory-Token Interface: Discrete output tokens (<ACT>, <START>, <CONTACT>, <END>) encapsulate all high-level semantic and waypoint guidance for subsequent motion generation. <ACT> provides an action-semantic embedding; each waypoint token yields a continuous keypoint specification.
Motion Expert (Flow-Matching Transformer): An encoder–decoder transformer (6 layers, 8 heads, hidden dimension 768) processes a sequence composed of past-motion tokens, the positionally-inserted waypoint tokens, future query tokens, and non-temporal context features. This module learns a conditional velocity field via Flow Matching, integrating velocities at inference to output a smooth 6-DoF wrist trajectory.

2. Input and Output Representations

EgoMAN’s core entities are structured as follows:

Component	Representation	Processing Pipeline
Video frame ( $V_t$ )	Feature map $f \in \mathbb{R}^{768}$	Qwen2.5-VL + DINOv3 vision encoder
Language intent ( $I$ )	Text embedding $e \in \mathbb{R}^{768}$	Qwen2.5-VL text tokenizer/encoder
Past trajectory	$\{(L_\tau, R_\tau)\}$ , $L_\tau \in \mathbb{R}^3$ , $R_\tau \in \mathbb{R}^6$	4-layer MLP
Trajectory tokens	{ACT, START, CONTACT, END}	Qwen2.5-VL Transformer → MLP Heads

The output for each waypoint token ( $k \in \{$ START, CONTACT, END $\}$ ) is a tuple $(t_k, p_k, r_k)$ of timestamp, 3D position, and 6D rotation. The ACT token is mapped to a semantic embedding via a learned projection.

3. Trajectory-Token Formalization and Supervision

Each trajectory-token output is trained to match ground-truth waypoints with a weighted composite loss:

$\mathcal{L}_{\mathrm{wp}} = \sum_k \left[ \lambda_t\,\mathcal{L}_{\mathrm{time}}(\hat{t}_k, t^*_k) + \lambda_{3D}\,\mathcal{L}_{3D}(\hat{p}_k, p^*_k) + \lambda_{2D}\,\mathcal{L}_{2D}(\pi(\hat{p}_k), \pi(p^*_k)) + \lambda_{r}\,\mathcal{L}_{\mathrm{rot6D}}(\hat{r}_k, r^*_k) + \lambda_{\mathrm{geo}}\,\mathcal{L}_{\mathrm{geo}}(\hat{r}_k, r^*_k) \right]$

where loss components include L1, Huber, 2D projection, and geodesic errors; loss weights are $\lambda_t=1.0$ , $\lambda_{3D}=2.0$ , $\lambda_{2D}=0.5$ , $\lambda_r=0.5$ , $\lambda_{\mathrm{geo}}=0.15$ .

The ACT token’s embedding is contrastively aligned to a CLIP-encoded ground-truth vector, with an objective switching between cosine similarity and temperature-scaled softmax based on batch size threshold.

4. Progressive Training Paradigm

EgoMAN employs a three-stage training process to align semantics with motion:

Reasoning Pre-training: Utilizes 1 million QA pairs. Supervision combines text (next-token cross-entropy), ACT (contrastive loss), and waypoint (composite loss) terms. Loss weights: $\lambda_{\mathrm{act}} = 0.1$ , $\lambda_{\mathrm{wp}} = 0.3$ .
Motion Expert Pre-training: Trained on 17,000 high-quality trajectories, conditioning on ground-truth ACT semantics and waypoints. Supervised by the Flow Matching objective, optimizing a learned velocity field aligning noisy current to ground-truth trajectories.
Joint Fine-tuning: The predicted tokens from the Reasoning Module directly condition the Motion Expert, with objectives over token sequence and trajectory. This phase aligns the modules under realistic token noise at inference.

5. Hyper-parameters and Optimization

Key hyper-parameters are as follows:

Reasoning Module: Qwen2.5-VL (7B), 4-layer motion MLP (hidden 768), single/4-layer MLP heads, AdamW optimizer ( $\beta_1=0.9$ , $\beta_2=0.999$ ), learning rate $1\times10^{-5}$ , batch size 256 (8 $\times$ A100), bf16.
Motion Expert: 6 encoder + 6 decoder layers, 8 heads, hidden 768, sinusoidal time embedding (256-D), FiLM modulation, AdamW ( $1\times10^{-4}$ ), batch 256 (1 $\times$ A100), FP32, 50 future steps (at 10 Hz).
Joint Fine-tuning: learning rate $5\times10^{-6}$ , batch 128 (8 $\times$ A100), 60 epochs, FP16 for reasoning, FP32 for motion.
Token Vocabulary: Four special tokens {ACT, START, CONTACT, END}; all other text uses the standard Qwen2.5-VL tokenizer.

6. Evaluation and Empirical Results

EgoMAN’s performance is assessed on both in-domain (EgoMAN-Unseen) and out-of-distribution (HOT3D-OOD) splits using metrics such as Average Displacement Error (ADE), Final Displacement Error (FDE), Dynamic Time Warping (DTW), rotation error, and waypoint accuracy.

Results Summary:

Metric	EgoMAN-Unseen	HandsOnVLM*	HOT3D-OOD	HandsOnVLM*
ADE (m)	0.124	0.171	0.141	0.194
FDE (m)	0.179	—	0.217	—
DTW (m)	0.111	—	0.130	—
Rot (°)	32.75	—	35.09	—

On EgoMAN-Unseen, ADE is reduced by 27.5% versus HandsOnVLM* (from 0.171m to 0.124m).
On HOT3D-OOD, ADE is reduced by 27.3% versus HandsOnVLM* (from 0.194m to 0.141m).
Contact and trajectory waypoint errors for EgoMAN-WP (reasoning module only) reach 0.192m and 0.127m on EgoMAN-Unseen.
Inference speed of EgoMAN-WP is 3.45 FPS, significantly faster than affordance baselines (VRB*, VidBot <0.05 FPS).
Motion-to-text alignment metrics: Recall@3 = 43.9%, FID = 0.04 (239 verbs), outperforming HandsOnVLM* (27.9%/0.10).

Ablation Studies:

Removing reasoning pretraining and waypoints (EgoMAN-ACT) increases ADE from 0.151m to 0.215m.
Omitting FM pretraining increases ADE to 0.273m and rotation error to 51.8°.
Replacing explicit 6DoF waypoints with implicit embeddings results in minor degradation.
Using only 20% of the data, EgoMAN maintains better ADE compared to baselines.

Module Scaling: Reasoning models above 4B parameters provide diminishing returns on spatial waypoints, with semantic alignment peaking at 7B. Larger models yield lower ADE/FDE and rotation errors across datasets.

7. Significance and Context

EgoMAN achieves state-of-the-art accuracy and generalization for 3D hand trajectory forecasting in egocentric, real-world settings. Its compact trajectory-token interface enables tight coupling between semantic reasoning and physical motion. The progressive training paradigm is critical to aligning high-level vision–language intent with low-level motion execution. The availability of large-scale QA-supervised egocentric data supports reasoning capabilities not previously possible with trajectory prediction frameworks (Chen et al., 18 Dec 2025).

EgoMAN demonstrates substantial improvements in both accuracy and intent-to-motion alignment relative to prior work, such as HandsOnVLM*. The correlation between intent understanding and motion realism is strengthened via the explicit trajectory-token mediation. A plausible implication is the framework’s utility for embodied AI agents in applications that require nuanced scene understanding and robust hand-object interaction modeling from egocentric perspectives.

PDF Markdown Chat (Pro)

References (1)

Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to EgoMAN Model.