Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

HAGI++: Head-Assisted Gaze Imputation & Generation

Updated 10 November 2025
  • HAGI++ is a multi-modal diffusion framework that leverages head orientation and wrist signals to impute and generate realistic gaze trajectories.
  • It employs cross-modal transformers with self- and cross-attention to capture the natural coupling between eye, head, and body movements.
  • Empirical results show up to a 25.3% reduction in MAE and close matching of gaze velocity distributions across diverse mobile eye-tracking datasets.

HAGI++ (Head-Assisted Gaze Imputation and Generation) is a multi-modal diffusion-based framework for repairing and generating mobile eye-tracking data using head orientation and additional wearable body signals. It is designed to address missing gaze data due to blinks, detection errors, or environmental factors, advancing the state of the art in both imputation and pure generation of human gaze signals. By leveraging the strong correlation between eye, head, and body movements, HAGI++ produces gaze reconstructions and synthetic trajectories that closely mirror true human visual behaviour, as validated on large-scale mobile eye-tracking datasets.

1. Problem Context and Motivation

Mobile head-mounted eye tracking encounters pervasive data loss from sources such as physiological blinks (typically 150–450 ms per blink), occlusions, pupil-detection failures, and adverse illumination. Such losses introduce discontinuities, degrade data quality, and can render sequences unsuitable for downstream analyses in machine learning and XR interaction.

Classical methods—linear, nearest-neighbour, or spline interpolation—ensure continuity but generate overly smooth, unrealistic trajectories that fail to match empirically observed gaze velocity distributions and ignore the biological coupling of eye–head movements. Deep-learning imputation frameworks (e.g., BRITS [Bi-RNN], TimesNet [CNN], iTransformer/Informer/Crossformer [Transformers], DLinear [MLP], GP-VAE, US-GAN, CSDI [score-based diffusion]) address general time-series gaps, but remain single-modal and do not accurately reflect oculomotor dynamics, often overfitting fixations and producing unrealistic velocities. No existing method prior to HAGI++ systematically harnesses body-mounted sensors, notably head orientation and wrist-wearable movement, for context-aware gaze imputation or pure gaze generation.

2. Multi-Modal Diffusion Model Design

HAGI++ extends the Conditional Score-based Diffusion model for Imputation (CSDI), generalizing it to accept body movements as conditioning context. The target for imputation, x0stormx_0^{\text{storm}}, is treated as a missing slice of the gaze sequence, with the forward process degrading it via Gaussian noise in TT steps according to: q(xtstormx0storm)=N(xtstorm;αtx0storm,(1αt)I)q(x_t^{\text{storm}}|x_0^{\text{storm}}) = \mathcal{N}(x_t^{\text{storm}};\sqrt{\alpha_t}x_0^{\text{storm}}, (1-\alpha_t)I) This yields at each timestep: xtstorm=αtx0storm+1αtϵ,ϵN(0,I)x_t^{\text{storm}} = \sqrt{\alpha_t}x_0^{\text{storm}} + \sqrt{1-\alpha_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

The reverse (denoising) process reconstructs the missing values conditioned both on observed gaze x0cox_0^{co} and body movements BB (head HH plus optional wrist WW), via

pθ(xt1stormxtstorm,x0co,B)=N(xt1storm;μθ(xtstorm,tx0co,B),σtI)p_{\theta}(x_{t-1}^{\text{storm}}|x_t^{\text{storm}}, x_0^{co}, B) = \mathcal{N}(x_{t-1}^{\text{storm}}; \mu_{\theta}(x_t^{\text{storm}}, t \mid x_0^{co}, B), \sigma_t I)

where the mean is parameterized by the DNN as

μθ(xt,tx0co,B)=1αt[xt1αt1αtϵθ(xt,tx0co,B)]\mu_{\theta}(x_t, t \mid x_0^{co}, B) = \frac{1}{\sqrt{\alpha_t}} \left[ x_t - \frac{1-\alpha_t}{\sqrt{1-\alpha_t}} \epsilon_\theta(x_t, t \mid x_0^{co}, B) \right]

The loss used for optimization is the denoising score-matching loss,

L=Et,ϵϵϵθ(xt,tx0co,B)22 .L = \mathbb{E}_{t, \epsilon} \| \epsilon - \epsilon_\theta(x_t, t \mid x_0^{co}, B) \|_2^2\ .

No additional auxiliary or adversarial losses are employed.

3. Cross-Modal Transformer: Architecture and Fusion

HAGI++ is distinguished by a stack of NN cross-modal transformer blocks that integrate multiple fusion operations:

  • Input encoding: Observed and noisy gaze tokens, flattened SE(3) head (and optionally, wrist) matrices (Fourier-encoded to augment frequency sensitivity), and a binary missing-data mask—each transformed via an MLP to a DD-dimensional latent and positionally encoded.
  • Self-attention (gaze-to-gaze): Q,K,VQ,K,V projections from gaze tokens, yielding intra-gaze dependencies.
  • Cross-attention (gaze-to-body): Querying gaze tokens, with keys and values from body context, captures gaze–head/wrist coordination.
  • Hybrid FiLM-based fusion: At each block, pooled head and wrist features modulate gaze tokens via feature-wise linear modulation; that is,

GGϕw(ϕ(C))+ϕb(ϕ(C)),G \leftarrow G \odot \phi_w(\phi(C)) + \phi_b(\phi(C)),

where CC is the concatenated, pooled body representation, ϕ\phi, ϕw\phi_w, and ϕb\phi_b are learned linear transformations, and \odot indicates element-wise multiplication.

The final stack outputs are mapped by an MLP to ϵ^\hat{\epsilon}, predicting the noise required for the denoising objective.

4. Body Movement Integration: Head and Wrist Signals

Body context is operationalized by:

  • Head orientation: For each frame ll, the SLAM pose Tworld,tracker(l)SE(3)T^{(l)}_{\text{world,tracker}} \in SE(3) (four-by-three matrix), with incremental relative rotation/translation hl=(Tl)1Tl+1SE(3)h_l = (T^l)^{-1} T^{l+1} \in SE(3). Each is vectorized (12D) and Fourier positionally encoded.
  • Wrist/hand movement: Wearable devices provide Tworld,band(l)T_{\text{world,band}}^{(l)}; relative wrist motion for frame ll is wl=(Tworld,trackerl)1Tworld,bandlw_l = (T^{l}_{\text{world,tracker}})^{-1} T^l_{\text{world,band}}. This is similarly flattened and encoded. Ablation studies (see Section 9) show that rotational components contribute more strongly to predictive power than translations, and that using both wrists (where available) further improves performance, especially for sustained missing intervals.

5. Empirical Evaluation: Datasets, Protocols, and Metrics

Three major datasets are used for quantitative benchmarks:

  • Nymeria: 300 h, 264 participants, 50 sites; 30 Hz gaze, head SLAM, wrist, full-body motion, annotations.
  • Ego-Exo4D: 4.6 h, 72 recordings.
  • HOT3D: 3.5 h, 111 recordings.

Records are partitioned 80%/5%/15% for train/val/test splits on Nymeria. Cross-dataset evaluation assesses generalisation.

Missing-data protocol:

  • For imputation, 10%10\%, 30%30\%, 50%50\%, 90%90\% gaps are created: 10%10\% gaps correspond to segments of ≈450 ms (a canonical blink), others mask at least 150 ms at random.
  • A 100%100\% missing setting tests pure synthetic gaze generation.

Baselines:

  • Classical interpolation: Linear, Nearest.
  • Head-direction proxy.
  • Deep learning: iTransformer, DLinear, TimesNet, BRITS, CSDI.
  • Gaze–head: HAGI.
  • Generation: Pose2Gaze (full-body pose based).

Metrics:

  • Mean Angular Error (MAE): MAE=1Jj=1Jarccos(gjg^jgjg^j)\text{MAE} = \frac{1}{J} \sum_{j=1}^J \arccos\left(\frac{g_j \cdot \hat{g}_j}{|g_j||\hat{g}_j|}\right).
  • Jensen–Shannon (JS) divergence between imputed/generative and true gaze velocity distributions, computed over velocity histograms (30 Hz, 100 bins).

6. Quantitative Performance and Statistical Behaviour

Within-Dataset Results (Nymeria)

Method 10% 30% 50% 90%
Linear 4.96° 6.88° 9.68° 11.54°
Nearest 5.29° 6.52° 8.34° 12.61°
CSDI 4.72° 5.90° 7.44° 10.54°
HAGI 3.67° 4.55° 5.77° 8.53°
HAGI++ 3.54° 4.40° 5.58° 8.18°

Table: MAE (lower is better) as a function of missing data ratio (Nymeria dataset).

Method 10% 30% 50% 90%
Nearest 0.081 0.073 0.103 0.135
CSDI 0.044 0.042 0.037 0.030
HAGI 0.042 0.040 0.035 0.017
HAGI++ 0.045 0.042 0.036 0.017

Table: JS divergence (lower is better) between imputed and true gaze velocity distributions (Nymeria).

Cross-dataset evaluation mirrors these results: HAGI++ consistently demonstrates the lowest MAE (e.g., \sim2.98° at 10% missing on Ego-Exo4D) and matches or surpasses HAGI in JS divergence.

Statistical analyses: Velocity histograms computed at 30 Hz (100 bins) reveal that HAGI++ outputs not only exhibit lower angular errors but also closely replicate human gaze velocity statistics, as measured by JS divergence across all missing-data protocols.

7. Pure Gaze Generation and Ablation

In the scenario of 100%100\% missing gaze ("pure generation"), HAGI++ leverages only body movement signals:

Method MAE JS
Pose2Gaze 13.09° 0.238
HAGI++ (head only) 11.65° 0.138
+ left wrist 11.28° 0.153
+ right wrist 10.98° 0.156
+ both wrists 10.79° 0.064

Table: Gaze generation performance (Nymeria, 100% missing).

  • Incorporating both wrist signals yields a 17.6% MAE reduction over Pose2Gaze and a \approx73% reduction in JS.
  • Ablation indicates that both rotation and translation features of wrist inputs contribute, with combination yielding best results.
  • HAGI++ achieves better or comparable results to full-body pose-based generation using only wearable head and wrist signals.

A plausible implication is that commercial wearable sensors can substitute for full-body motion capture in challenging generation regimes.

8. Generalization, Efficiency, and Future Directions

HAGI++ generalises effectively across datasets (e.g., Nymeria, Ego-Exo4D, HOT3D) without the need for fine-tuning, and is reported as computationally efficient for batch inference. Its design accommodates both offline post-processing and real-time imputation/generation.

Limitations include untested performance on higher-frequency gaze data, potential sensitivity to large head–eye misalignments, and the exclusion of egocentric scene visual features. Future investigations aim to integrate egocentric images, analyze temporal misalignments, and optimize for real-time, on-device execution, particularly for interactive XR scenarios.

9. Summary of Key Contributions

HAGI++ introduces a diffusion-based conditional imputation and generation model specialized for mobile gaze data, with central innovations:

  • Multi-modal fusion of gaze, head, and wrist/hand movements via cross-modal transformers utilizing self-attention, cross-attention, and FiLM-based mechanisms.
  • Statistically realistic gaze reconstructions across a spectrum of missing-data conditions, validated on diverse real-world datasets.
  • Demonstrated applicability for both data restoration and full synthetic gaze generation, using only wearable sensor data for the latter.
  • Empirical reductions in MAE (up to 25.3% over baseline imputation) and close statistical matching to true gaze velocity distributions.
  • Design generalization and computational efficiency suited to both research post-processing and low-latency interactive applications.

Experimental findings confirm that head orientation signals are especially predictive of gaze, and that wrist motion yields additional gains, particularly for reconstruction over extended gaps. The modular architecture allows for future extension to incorporate further body cues or contextual scene information.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HAGI++.