GazeVLM: Gaze-Augmented Vision-Language Models

Updated 16 November 2025

GazeVLM is a methodology that integrates human or simulated gaze signals into vision-language models to provide enhanced grounding and spatial reasoning.
It employs techniques such as gaze-conditioned fusion, regularized attention, and gaze-guided cropping to improve semantic prediction and reduce computational load.
The integration leads to robust empirical gains in domains like robotics, medical imaging, and AR/VR by effectively anchoring visual attention and mitigating ambiguities.

GazeVLM encompasses a class of methodologies and systems that integrate eye gaze and visual attention cues—either human, machine-simulated, or derived from visual-LLM (VLM) attention dynamics—into the architecture and training of vision-LLMs for active perception, multimodal understanding, and embodied AI. These approaches leverage gaze as a privileged signal for grounding, anticipation, hallucination mitigation, and efficient encoding, transcending traditional bottom-up feature extraction. Distinct paradigms include third-person gaze understanding, egocentric activity analysis, medical imaging integration, robotics navigation, and user-guided perception. GazeVLM systems utilize diverse strategies such as gaze-conditioned fusion, gaze-regularized attention, gaze-informed spatial cropping, and gaze-shift-derived saliency computation. Founded on transformer VLM backbones (BLIP-2, Qwen2-VL, OpenFlamingo, etc.), these architectures demonstrate robust empirical gains in semantic prediction, interpretability, and efficiency across static imagery, video, and interactive scenarios.

1. Foundations and Problem Taxonomy

GazeVLM originated from independent lines of research seeking to unify person detection, gaze localization, and object-of-attention identification within VLMs (Mathew et al., 9 Nov 2025). The core motivation is that gaze conveys high-value information: in third-person settings, it reveals "who looks where at what," while in egocentric contexts, it exposes intent and future actions (Pani et al., 24 Oct 2025). Conventional vision-only pipelines struggle with occlusion, crop misalignments, and brittle semantic grounding, whereas gaze augments spatial reasoning with cognitive priors. Tasks addressed by GazeVLM systems include:

Setting	Task Set	Signal Source
Third-person	Person detection, gaze-point regression,	Image, depth, text prompt,
	gaze-object identification	HHA maps
Egocentric	Future event anticipation, activity parsing	Eye-tracking heatmaps
Robotics	Navigation pose selection, orientation	VLM zero-shot scoring
Medical (CXR)	Report generation, diagnosis	Radiologist gaze video
Human-AI UI	QA grounding, object referencing	AR/VR gaze

The taxonomic feature uniting these systems is their explicit or implicit use of gaze-derived spatial priors to guide model attention, output selection, or efficient representation.

2. Model Architectures and Gaze Integration Techniques

GazeVLM architectures span multiple integration paradigms:

GazeVLM for multi-task gaze understanding (Mathew et al., 9 Nov 2025), built on Qwen2-VL-2B, fuses RGB and HHA depth features in a frozen vision encoder, concatenating outputs for cross-attention in a text decoder. All tasks (detection, localization, identification) are posed as sequence generation conditional on a textual prompt (e.g., <box_start>…<box_end> for bounding boxes).

(b) Gaze-Regularized Attention Blocks

Egocentric GazeVLM (Pani et al., 24 Oct 2025) and Voila-A (Yan et al., 2023) insert gaze into the attention mechanism via heatmap-derived per-patch priors. In training, VLM attention maps $A_t$ are regularized against human gaze distributions $\tilde H_{t}$ using KL-divergence penalties, increasing attention–gaze overlap from ~42% (baseline) to ~68%.

(c) Gaze-Guided Cropping and Feature Fusion

GazeLLM (Rekimoto, 31 Mar 2025) proposes region decomposition: high-res crops around foveal gaze (≈10% of pixels) and downsampled periphery are encoded separately, then fused by a weighted sum for efficient LLM input. This maintains full comprehension with up to 8× memory reduction.

(d) Gaze Shift–Based Saliency and Hallucination Mitigation

GIFT (Qi et al., 24 Oct 2025) computes "gaze shifts" as positive changes in visual attention over salient query tokens, producing a normalized saliency map $S$ driving both visual and query attention amplification during decoding. This approach reduces caption-level hallucination up to 20.7% (CHAIR $C_s$ ) and maintains general accuracy.

(e) Sequential Gaze Video Representation

RadEyeVideo (Kim et al., 12 Jul 2025) embeds both the spatial and temporal dynamics of expert gaze in medical imaging. Eye-fixation sequences are rendered as video overlays, encoded via patch or spatio-temporal transformer backbones, and then fused with image features by gated concatenation for report generation and diagnosis.

3. Mathematical Formalizations and Algorithms

Formalisms are tailored to gaze source and model type:

Candidate pose set: $P = \{p_i = (x_i, y_i, \theta_i)\}$ for robotic VLM navigation (Zhu et al., 2024).
Gaze heatmaps: $G_t(x,y) = \sum_{i=1}^N \exp( - ((x-x_t^{(i)})^2 + (y-y_t^{(i)})^2) / (2\sigma^2) )$ (Rekimoto, 31 Mar 2025).
Patch-wise attention alignment: $D_{\mathrm{KL}}(A_t \Vert \tilde H_t) = \sum_{i=1}^P A_{t,i}\log \frac{A_{t,i}}{\tilde H_{t,i}}$ (Pani et al., 24 Oct 2025).
Gaze shift saliency: $S = \mathrm{Normalize}_{\max-\min}\left(\sum_{t\in T_r} \max(0, \alpha_t^V - \alpha_{t-1}^V)\right)$ (Qi et al., 24 Oct 2025).
Object-level AP: $AP_{ob} = \frac{1}{|\mathcal C|} \sum_{c \in \mathcal C} AP_c$ , with $AP_c$ as area under precision–recall curve for LVIS classes (Mathew et al., 9 Nov 2025).

Algorithmic implementations range from two-stage candidate scoring with VLM prompts (Zhu et al., 2024), early fusion of VLM-extracted cues into transformer person tokens (Gupta et al., 2024), to gated multimodal concatenation after separate vision/video encoders (Kim et al., 12 Jul 2025).

4. Empirical Evaluation and Key Results

GazeVLM methods exhibit robust quantitative gains across domains:

Domain	Metric	Baseline	GazeVLM	Relative Gain
Robotics	DTG (m)	1.79	0.56	-68.8%
GazeFollow	AUC	0.928	0.929	+0.1%
GazeFollow	AvgDist	0.122	0.131	Small trade-off
VQA	Hallucination $C_s$	50.2%	39.8%	-20.7%
Medical	CheXbert F1	ref=100	124	+24.6%
Egocentric	Semantic Similarity	0.6525	0.7505	+15%

Ablations reveal that RGB+HHA fusion outperforms naive depth inclusion (Mathew et al., 9 Nov 2025), and heatmap-based gaze tokens yield superior grounding versus discrete or bounding-box methods (Yan et al., 2023). Visual prompts with full image and annotated ellipse further improve cue extraction in multi-person gaze following (Gupta et al., 2024). Latency and computational overhead is modest for GIFT (13% increase over greedy decoding (Qi et al., 24 Oct 2025)); gaze-crop pipelines offer order of magnitude reduction in token count with preserved or improved coverage (Rekimoto, 31 Mar 2025).

5. Applications, Limitations, and Future Directions

GazeVLM underpins diverse applications:

Active visual navigation and manipulation (robotics, scanning) by scoring candidate camera poses for accessibility and orientation (Zhu et al., 2024).
Semantic event anticipation, wearable intelligence, and collaboration, leveraging human gaze for predictive cues in egocentric video (Pani et al., 24 Oct 2025, Rekimoto, 31 Mar 2025).
Medical report generation and diagnosis via radiologist eye-movement video (Kim et al., 12 Jul 2025).
Real-world QA and object-referencing in AR/VR, aligning model output with user gaze (Yan et al., 2023).
Hallucination mitigation in open-domain VLMs using gaze-shift-inspired saliency (Qi et al., 24 Oct 2025).

Limitations include reliance on high-quality gaze data (misalignment degrades attention regularization (Pani et al., 24 Oct 2025)), inability to exploit temporal continuity in some static-only frameworks (Mathew et al., 9 Nov 2025), susceptibility to prompt-format dependencies, and, occasionally, increased latency or memory footprint. Failure modes noted in counting, high-detail OCR, and certain ambiguous spatial queries remain open challenges.

Anticipated research directions encompass video-level gaze tracking, attention-aware multimodal fusion (e.g., deformable patches, multi-resolution encoding), multi-modal expansion (voice, gestures), efficient model distillation for real-time operation, and self-supervised learning on large unlabelled corpora.

6. Interpretations and Open Questions

GazeVLM systems suggest a unifying abstraction: gaze—whether human, machine-predicted, or model-internal—serves as a proxy for cognitive attention, enhancing selective fusion, semantic grounding, and anticipatory understanding. A plausible implication is that future VLMs will operate in continuous perception–action cycles, with gaze priors dynamically steering modality fusion and language output. The convergence of third-person and egocentric gaze pipelines remains an unresolved frontier, notably in multi-agent social interaction and embodied collaboration scenarios.

7. Comparative Table of Major GazeVLM Paradigms

Name	Integration Method	Tasks	Key Results
GazeVLM (Mathew et al., 9 Nov 2025)	VLM fusion, RGB+HHA, prompts	Person/gaze/object detection	AUC 0.929 (GF), AP $_{ob}$ 0.23
Gaze-VLM (Pani et al., 24 Oct 2025)	Gaze-regularized attention	Future event, activity prediction	+11%/7% semantic gain, C_I ↓ 32%
GazeLLM (Rekimoto, 31 Mar 2025)	Gaze-crop fusion	Activity parsing, instruction gen.	≃Full metric, −10× tokens
GIFT (Qi et al., 24 Oct 2025)	Gaze-shift saliency	Hallucination mitigation	−20.7% CHAIR $C_s$ , +11.7% MMHal
Voila-A (Yan et al., 2023)	Gaze heatmap perceiver block	QA grounding, AR/VR agent	Highest helpfulness/grounding
RadEyeVideo (Kim et al., 12 Jul 2025)	Gaze-video (sequential)	CXR report/gen, diagnosis	+24.6% (F1); passes domain LLMs
Navi2Gaze (Zhu et al., 2024)	VLM pose scoring, prompt	Navigation, object orientation	DTG -68.8%, SR +0.57

Editor's term. This table synthetically distinguishes GazeVLM approaches by their integration and principal task domains.

GazeVLM research demonstrates that augmenting vision-LLMs with gaze—in all its forms—improves the selectivity, grounding, and anticipation of AI systems in vision, language, and action tasks. Continued progress in gaze signal acquisition, multi-modal fusion, and efficient architecture design is expected to further enhance real-world applications in robotics, medicine, social AI, and human-computer interaction.