Papers
Topics
Authors
Recent
2000 character limit reached

Gaze-Primed Human Motion Sequences

Updated 19 December 2025
  • Gaze-primed human motion sequences are episodes where eye or head shifts reliably predict forthcoming body movements, defining human intent.
  • High-fidelity datasets and multi-modal fusion techniques enable accurate extraction of prime-to-action events and robust kinematic forecasting.
  • Integrating anticipatory gaze cues dramatically improves prediction metrics like MPJPE and reach success in robotics and HCI applications.

Gaze-primed human motion sequences are temporally coherent episodes of bodily movement whose onset, target selection, or trajectory is explicitly anticipated (“primed”) by preceding eye or head gaze shifts toward salient environmental objects or locations. Such sequences are foundational to understanding human intent inference, naturalistic kinematic forecasting, and the integration of anticipatory cues into robotics and human-computer interaction systems. Recent empirical and modeling advances have elucidated both the neurobehavioral underpinnings and computational significance of gaze priming for accurate, physically plausible, and semantically meaningful motion prediction across manipulation, locomotion, and reach scenarios.

1. Definition and Behavioral Foundations

Gaze priming describes the phenomenon in which a subject’s eye or head orientation—typically measured as a 3D fixation point, direction vector, or intersection with a semantic scene element—precedes and statistically predicts the kinematic initiation or modulation of purposeful body movement. Empirical studies using motion capture and wearable gaze trackers have documented that, in both manipulation and locomotion contexts, humans reliably fixate intended targets (objects, waypoints, or future trajectory endpoints) 0.5–1 s prior to gross arm, hand, or trunk motion, with premotor gaze shifts encoding intent information unattainable from motion features alone (Kratzer et al., 2020, Schreiter et al., 2022).

Canonical “gaze-primed motion” events are identified by aligning temporally contiguous time windows in which a participant’s gaze vector or point intersects the 3D bounding box of a future target, followed by the onset of coordinated movement toward that target (Hatano et al., 18 Dec 2025). This timing—the “prime gap”—is a central parameter in mechanistic annotation and automatic extraction pipelines.

2. Datasets and Sequence Extraction

High-fidelity datasets are essential for modeling gaze-primed motion. The MoGaze dataset (Kratzer et al., 2020) contains 1627 pick-and-place sequences with synchronized 3D full-body MoCap (120 Hz) and eye gaze (200 Hz), workspace geometry, and object annotations, enabling precise analysis of gaze-to-action causal chains. The GIMO dataset (Zheng et al., 2022) provides 217 long-horizon trajectories with ego-centric video, LiDAR scene scans, hand pose, and 3D gaze in diverse real indoor environments. The Magni dataset (Schreiter et al., 2022) offers mm-precise head/body trajectories, high-rate gaze, scenario tags, and context annotations for modeling navigation and object transport tasks.

Extraction of gaze-primed sequence windows follows the procedure: detect fixations where the gaze ray intersects a task-relevant region, record the window from fixation (prime) to subsequent movement threshold crossing, and include context before and after the predicted event (Hatano et al., 18 Dec 2025, Schreiter et al., 2022).

Dataset Domain Sensors Gaze Sync Hz Sequence Criteria
MoGaze Manipulation MoCap+Gaze 120/200 Fixation on object ≤1s before grasp+motion onset
GIMO Mobile, AR IMUs+Gaze+3D 96 Gaze-point intersect with LiDAR scene + action annotation
Magni Locomotion MoCap+Gaze 120/100 Gaze on region, 2s window before significant displacement
Prime&Reach Manipulation Varies Varies Automated prime detection using 3D intersection+event times

These annotated sequences serve as ground truth for training and evaluation of prediction models, and as basis for defining metrics such as Prime Success and Reach Success (Hatano et al., 18 Dec 2025).

3. Computational Modeling Paradigms

Modern gaze-primed motion predictors leverage multi-modal architectures that explicitly fuse streams of gaze data, body pose history, and—frequently—contextual or object scene features. Two principal modeling strategies have emerged:

  • Deterministic Graph-based Models: Approaches such as GazeMotion (Hu et al., 14 Mar 2024) and the hand-focused method of (He et al., 27 Mar 2025) use spatio-temporal graphs or graph convolutional networks (GCNs) to encode body+gaze histories. In GazeMotion, a future gaze prediction module first forecasts likely eye trajectories which are then fused as dedicated graph nodes with joint pose history, all processed through a deep residual GCN. The hand-motion pipeline (He et al., 27 Mar 2025) employs a VQ-VAE to encode discrete hand pose histories, with concatenated gaze tokens in a transformer-based sequence generator. Gaze and motion features are fused at the input level or via cross-modal attention, and training objectives include angular error for gaze and joint-space losses (e.g., MPJPE, trajectory norms).
  • Stochastic Diffusion Models with Cross-modal Attention: Models like GazeMoDiff (Yan et al., 2023) and Prime and Reach (Hatano et al., 18 Dec 2025) implement denoising diffusion processes conditional on gaze and scene priors. GazeMoDiff concatenates pose and gaze into a spatio-temporal graph, fuses them with a GAT, and injects fused features into each diffusion block using cross-attention. Prime and Reach curates large-scale prime+reach sequences and conditions transformer-based diffusion synthesis on scene, initial state, and target object/pose embedding, enabling quantitative evaluation of both priming (head/eye alignment with goal) and reach (wrist to target) success.

Contemporary frameworks integrate additional context modalities, most notably environment point clouds, with specialized attention mechanisms (e.g. ternary intention-aware attention, as in (Lou et al., 5 May 2024)) to ensure physical plausibility and resolve motion-goal ambiguity.

4. Evaluation Metrics, Quantitative Performance, and Ablation

The effectiveness of gaze-primed models is measured by metrics sensitive to both destination accuracy and anticipatory alignment. Key metrics include:

  • Prime Success: The proportion of predicted sequences where the agent’s head-forward vector aligns within a defined angular threshold of ground-truth gaze at annotated prime time (e.g., θ=16°, σ=0.2s window (Hatano et al., 18 Dec 2025)).
  • Reach Success: Fraction of sequences where the predicted hand/wrist joints achieve spatial proximity (≤10 cm) to the goal location at movement endpoint (Hatano et al., 18 Dec 2025).
  • Mean Per-Joint Position Error (MPJPE): Average joint-wise Euclidean error, both path-wise and at motion destination (Hu et al., 14 Mar 2024, Zheng et al., 2022).
  • Final/average displacement error (FDE/ADE): Path deviation against ground truth (Yan et al., 2023, Lou et al., 5 May 2024).

Empirical results demonstrate that inclusion of gaze as a conditioning signal yields substantial and statistically significant improvements. For example, GazeMotion achieves up to 7.4% improvement in MPJPE on MoGaze over the strongest pose-only baseline (Hu et al., 14 Mar 2024). GazeMoDiff reduces FDE by 15.24% and multi-modal FDE by 18.8% over HumanMAC on MoGaze, and Prime and Reach reports 60% prime success and 89% reach success on HD-EPIC—values far above text- or pose-only control baselines (Hatano et al., 18 Dec 2025). Qualitative human studies validate greater perceived realism and precision achievable with gaze-primed models (Hu et al., 14 Mar 2024, Yan et al., 2023).

Ablation studies consistently show that removing or corrupting gaze input leads to marked degradation in destination accuracy, anticipation (priming) fidelity, or human-judged naturalness (Zheng et al., 2022, Lou et al., 5 May 2024).

5. Algorithmic Mechanisms and Modal Fusion

Efficient fusion of gaze with pose and scene information underpins the gains of gaze-primed models. Technical mechanisms include:

These components ensure that the network can resolve multimodal ambiguity (e.g., selecting among plausible paths/objects), especially in dense or cluttered scenes where pose-only models often fail.

6. Generalization, Failure Modes, and Practical Recommendations

Gaze-primed sequence modeling extends beyond hand grasping and pick-and-place to general full-body locomotion, reach, and whole-scene navigation. The vector-quantized encoding and modular transformer or diffusion heads facilitate domain re-use (e.g., swapping in additional joint channels or alternative intention priors such as contact cues) (He et al., 27 Mar 2025).

Challenges include sensitivity to gaze measurement noise or ambiguous environments with competing salient targets. Models may mispredict when gaze is dominated by distractors or lacks clear intent structure (Zheng et al., 2022, He et al., 27 Mar 2025). Practical guidance emphasizes tight time-synchronization (≥120 Hz), global coordinate calibration, and the use of precomputed gaze-object intersection features for downstream fusion (Kratzer et al., 2020, Schreiter et al., 2022).

7. Significance and Future Directions

Gaze-primed human motion sequence modeling provides a principled, empirically validated pathway for decoding intent and improving short-horizon and long-horizon motion prediction. Applications span assistive robotics, AR/VR embodiment, collaborative autonomy, and neurorehabilitation. Current work is extending these models to incorporate finer-grained hand and finger grasp synthesis, complex multi-agent setting, and to include additional intention signals beyond gaze (e.g., facial expression, language cues) (He et al., 27 Mar 2025, Hatano et al., 18 Dec 2025).

The systematic integration of gaze into prediction architectures not only improves quantitative accuracy but also enables interpretable, physically realizable behavior synthesis, making gaze priming a central tenet in next-generation human-robot and human-computer interaction research.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Gaze-Primed Human Motion Sequences.