HAGI++: Head-Assisted Gaze Imputation & Generation

Updated 10 November 2025

HAGI++ is a multi-modal diffusion framework that leverages head orientation and wrist signals to impute and generate realistic gaze trajectories.
It employs cross-modal transformers with self- and cross-attention to capture the natural coupling between eye, head, and body movements.
Empirical results show up to a 25.3% reduction in MAE and close matching of gaze velocity distributions across diverse mobile eye-tracking datasets.

HAGI++ (Head-Assisted Gaze Imputation and Generation) is a multi-modal diffusion-based framework for repairing and generating mobile eye-tracking data using head orientation and additional wearable body signals. It is designed to address missing gaze data due to blinks, detection errors, or environmental factors, advancing the state of the art in both imputation and pure generation of human gaze signals. By leveraging the strong correlation between eye, head, and body movements, HAGI++ produces gaze reconstructions and synthetic trajectories that closely mirror true human visual behaviour, as validated on large-scale mobile eye-tracking datasets.

1. Problem Context and Motivation

Mobile head-mounted eye tracking encounters pervasive data loss from sources such as physiological blinks (typically 150–450 ms per blink), occlusions, pupil-detection failures, and adverse illumination. Such losses introduce discontinuities, degrade data quality, and can render sequences unsuitable for downstream analyses in machine learning and XR interaction.

Classical methods—linear, nearest-neighbour, or spline interpolation—ensure continuity but generate overly smooth, unrealistic trajectories that fail to match empirically observed gaze velocity distributions and ignore the biological coupling of eye–head movements. Deep-learning imputation frameworks (e.g., BRITS [Bi-RNN], TimesNet [CNN], iTransformer/Informer/Crossformer [Transformers], DLinear [MLP], GP-VAE, US-GAN, CSDI [score-based diffusion]) address general time-series gaps, but remain single-modal and do not accurately reflect oculomotor dynamics, often overfitting fixations and producing unrealistic velocities. No existing method prior to HAGI++ systematically harnesses body-mounted sensors, notably head orientation and wrist-wearable movement, for context-aware gaze imputation or pure gaze generation.

HAGI++ extends the Conditional Score-based Diffusion model for Imputation (CSDI), generalizing it to accept body movements as conditioning context. The target for imputation, $x_0^{\text{storm}}$ , is treated as a missing slice of the gaze sequence, with the forward process degrading it via Gaussian noise in $T$ steps according to: $q(x_t^{\text{storm}}|x_0^{\text{storm}}) = \mathcal{N}(x_t^{\text{storm}};\sqrt{\alpha_t}x_0^{\text{storm}}, (1-\alpha_t)I)$ This yields at each timestep: $x_t^{\text{storm}} = \sqrt{\alpha_t}x_0^{\text{storm}} + \sqrt{1-\alpha_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

The reverse (denoising) process reconstructs the missing values conditioned both on observed gaze $x_0^{co}$ and body movements $B$ (head $H$ plus optional wrist $W$ ), via

$p_{\theta}(x_{t-1}^{\text{storm}}|x_t^{\text{storm}}, x_0^{co}, B) = \mathcal{N}(x_{t-1}^{\text{storm}}; \mu_{\theta}(x_t^{\text{storm}}, t \mid x_0^{co}, B), \sigma_t I)$

where the mean is parameterized by the DNN as

$\mu_{\theta}(x_t, t \mid x_0^{co}, B) = \frac{1}{\sqrt{\alpha_t}} \left[ x_t - \frac{1-\alpha_t}{\sqrt{1-\alpha_t}} \epsilon_\theta(x_t, t \mid x_0^{co}, B) \right]$

The loss used for optimization is the denoising score-matching loss,

$L = \mathbb{E}_{t, \epsilon} \| \epsilon - \epsilon_\theta(x_t, t \mid x_0^{co}, B) \|_2^2\ .$

No additional auxiliary or adversarial losses are employed.

HAGI++ is distinguished by a stack of $N$ cross-modal transformer blocks that integrate multiple fusion operations:

Input encoding: Observed and noisy gaze tokens, flattened SE(3) head (and optionally, wrist) matrices (Fourier-encoded to augment frequency sensitivity), and a binary missing-data mask—each transformed via an MLP to a $D$ -dimensional latent and positionally encoded.
Self-attention (gaze-to-gaze): $Q,K,V$ projections from gaze tokens, yielding intra-gaze dependencies.
Cross-attention (gaze-to-body): Querying gaze tokens, with keys and values from body context, captures gaze–head/wrist coordination.
Hybrid FiLM-based fusion: At each block, pooled head and wrist features modulate gaze tokens via feature-wise linear modulation; that is,

$G \leftarrow G \odot \phi_w(\phi(C)) + \phi_b(\phi(C)),$

where $C$ is the concatenated, pooled body representation, $\phi$ , $\phi_w$ , and $\phi_b$ are learned linear transformations, and $\odot$ indicates element-wise multiplication.

The final stack outputs are mapped by an MLP to $\hat{\epsilon}$ , predicting the noise required for the denoising objective.

4. Body Movement Integration: Head and Wrist Signals

Body context is operationalized by:

Head orientation: For each frame $l$ , the SLAM pose $T^{(l)}_{\text{world,tracker}} \in SE(3)$ (four-by-three matrix), with incremental relative rotation/translation $h_l = (T^l)^{-1} T^{l+1} \in SE(3)$ . Each is vectorized (12D) and Fourier positionally encoded.
Wrist/hand movement: Wearable devices provide $T_{\text{world,band}}^{(l)}$ ; relative wrist motion for frame $l$ is $w_l = (T^{l}_{\text{world,tracker}})^{-1} T^l_{\text{world,band}}$ . This is similarly flattened and encoded. Ablation studies (see Section 9) show that rotational components contribute more strongly to predictive power than translations, and that using both wrists (where available) further improves performance, especially for sustained missing intervals.

5. Empirical Evaluation: Datasets, Protocols, and Metrics

Three major datasets are used for quantitative benchmarks:

Nymeria: 300 h, 264 participants, 50 sites; 30 Hz gaze, head SLAM, wrist, full-body motion, annotations.
Ego-Exo4D: 4.6 h, 72 recordings.
HOT3D: 3.5 h, 111 recordings.

Records are partitioned 80%/5%/15% for train/val/test splits on Nymeria. Cross-dataset evaluation assesses generalisation.

Missing-data protocol:

For imputation, $10\%$ , $30\%$ , $50\%$ , $90\%$ gaps are created: $10\%$ gaps correspond to segments of ≈450 ms (a canonical blink), others mask at least 150 ms at random.
A $100\%$ missing setting tests pure synthetic gaze generation.

Baselines:

Classical interpolation: Linear, Nearest.
Head-direction proxy.
Deep learning: iTransformer, DLinear, TimesNet, BRITS, CSDI.
Gaze–head: HAGI.
Generation: Pose2Gaze (full-body pose based).

Metrics:

Mean Angular Error (MAE): $\text{MAE} = \frac{1}{J} \sum_{j=1}^J \arccos\left(\frac{g_j \cdot \hat{g}_j}{|g_j||\hat{g}_j|}\right)$ .
Jensen–Shannon (JS) divergence between imputed/generative and true gaze velocity distributions, computed over velocity histograms (30 Hz, 100 bins).

6. Quantitative Performance and Statistical Behaviour

Within-Dataset Results (Nymeria)

Method	10%	30%	50%	90%
Linear	4.96°	6.88°	9.68°	11.54°
Nearest	5.29°	6.52°	8.34°	12.61°
CSDI	4.72°	5.90°	7.44°	10.54°
HAGI	3.67°	4.55°	5.77°	8.53°
HAGI++	3.54°	4.40°	5.58°	8.18°

Table: MAE (lower is better) as a function of missing data ratio (Nymeria dataset).

Method	10%	30%	50%	90%
Nearest	0.081	0.073	0.103	0.135
CSDI	0.044	0.042	0.037	0.030
HAGI	0.042	0.040	0.035	0.017
HAGI++	0.045	0.042	0.036	0.017

Table: JS divergence (lower is better) between imputed and true gaze velocity distributions (Nymeria).

Cross-dataset evaluation mirrors these results: HAGI++ consistently demonstrates the lowest MAE (e.g., $\sim$ 2.98° at 10% missing on Ego-Exo4D) and matches or surpasses HAGI in JS divergence.

Statistical analyses: Velocity histograms computed at 30 Hz (100 bins) reveal that HAGI++ outputs not only exhibit lower angular errors but also closely replicate human gaze velocity statistics, as measured by JS divergence across all missing-data protocols.

7. Pure Gaze Generation and Ablation

In the scenario of $100\%$ missing gaze ("pure generation"), HAGI++ leverages only body movement signals:

Method	MAE	JS
Pose2Gaze	13.09°	0.238
HAGI++ (head only)	11.65°	0.138
+ left wrist	11.28°	0.153
+ right wrist	10.98°	0.156
+ both wrists	10.79°	0.064

Table: Gaze generation performance (Nymeria, 100% missing).

Incorporating both wrist signals yields a 17.6% MAE reduction over Pose2Gaze and a $\approx$ 73% reduction in JS.
Ablation indicates that both rotation and translation features of wrist inputs contribute, with combination yielding best results.
HAGI++ achieves better or comparable results to full-body pose-based generation using only wearable head and wrist signals.

A plausible implication is that commercial wearable sensors can substitute for full-body motion capture in challenging generation regimes.

8. Generalization, Efficiency, and Future Directions

HAGI++ generalises effectively across datasets (e.g., Nymeria, Ego-Exo4D, HOT3D) without the need for fine-tuning, and is reported as computationally efficient for batch inference. Its design accommodates both offline post-processing and real-time imputation/generation.

Limitations include untested performance on higher-frequency gaze data, potential sensitivity to large head–eye misalignments, and the exclusion of egocentric scene visual features. Future investigations aim to integrate egocentric images, analyze temporal misalignments, and optimize for real-time, on-device execution, particularly for interactive XR scenarios.

9. Summary of Key Contributions

HAGI++ introduces a diffusion-based conditional imputation and generation model specialized for mobile gaze data, with central innovations:

Multi-modal fusion of gaze, head, and wrist/hand movements via cross-modal transformers utilizing self-attention, cross-attention, and FiLM-based mechanisms.
Statistically realistic gaze reconstructions across a spectrum of missing-data conditions, validated on diverse real-world datasets.
Demonstrated applicability for both data restoration and full synthetic gaze generation, using only wearable sensor data for the latter.
Empirical reductions in MAE (up to 25.3% over baseline imputation) and close statistical matching to true gaze velocity distributions.
Design generalization and computational efficiency suited to both research post-processing and low-latency interactive applications.

Experimental findings confirm that head orientation signals are especially predictive of gaze, and that wrist motion yields additional gains, particularly for reconstruction over extended gaps. The modular architecture allows for future extension to incorporate further body cues or contextual scene information.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to HAGI++.

HAGI++: Head-Assisted Gaze Imputation & Generation

1. Problem Context and Motivation

2. Multi-Modal Diffusion Model Design

3. Cross-Modal Transformer: Architecture and Fusion