Papers
Topics
Authors
Recent
2000 character limit reached

GazeVLM: Gaze-Augmented Vision-Language Models

Updated 16 November 2025
  • GazeVLM is a methodology that integrates human or simulated gaze signals into vision-language models to provide enhanced grounding and spatial reasoning.
  • It employs techniques such as gaze-conditioned fusion, regularized attention, and gaze-guided cropping to improve semantic prediction and reduce computational load.
  • The integration leads to robust empirical gains in domains like robotics, medical imaging, and AR/VR by effectively anchoring visual attention and mitigating ambiguities.

GazeVLM encompasses a class of methodologies and systems that integrate eye gaze and visual attention cues—either human, machine-simulated, or derived from visual-LLM (VLM) attention dynamics—into the architecture and training of vision-LLMs for active perception, multimodal understanding, and embodied AI. These approaches leverage gaze as a privileged signal for grounding, anticipation, hallucination mitigation, and efficient encoding, transcending traditional bottom-up feature extraction. Distinct paradigms include third-person gaze understanding, egocentric activity analysis, medical imaging integration, robotics navigation, and user-guided perception. GazeVLM systems utilize diverse strategies such as gaze-conditioned fusion, gaze-regularized attention, gaze-informed spatial cropping, and gaze-shift-derived saliency computation. Founded on transformer VLM backbones (BLIP-2, Qwen2-VL, OpenFlamingo, etc.), these architectures demonstrate robust empirical gains in semantic prediction, interpretability, and efficiency across static imagery, video, and interactive scenarios.

1. Foundations and Problem Taxonomy

GazeVLM originated from independent lines of research seeking to unify person detection, gaze localization, and object-of-attention identification within VLMs (Mathew et al., 9 Nov 2025). The core motivation is that gaze conveys high-value information: in third-person settings, it reveals "who looks where at what," while in egocentric contexts, it exposes intent and future actions (Pani et al., 24 Oct 2025). Conventional vision-only pipelines struggle with occlusion, crop misalignments, and brittle semantic grounding, whereas gaze augments spatial reasoning with cognitive priors. Tasks addressed by GazeVLM systems include:

Setting Task Set Signal Source
Third-person Person detection, gaze-point regression, Image, depth, text prompt,
gaze-object identification HHA maps
Egocentric Future event anticipation, activity parsing Eye-tracking heatmaps
Robotics Navigation pose selection, orientation VLM zero-shot scoring
Medical (CXR) Report generation, diagnosis Radiologist gaze video
Human-AI UI QA grounding, object referencing AR/VR gaze

The taxonomic feature uniting these systems is their explicit or implicit use of gaze-derived spatial priors to guide model attention, output selection, or efficient representation.

2. Model Architectures and Gaze Integration Techniques

GazeVLM architectures span multiple integration paradigms:

(a) Cross-Modal Fusion with Gaze Signals

GazeVLM for multi-task gaze understanding (Mathew et al., 9 Nov 2025), built on Qwen2-VL-2B, fuses RGB and HHA depth features in a frozen vision encoder, concatenating outputs for cross-attention in a text decoder. All tasks (detection, localization, identification) are posed as sequence generation conditional on a textual prompt (e.g., <box_start>…<box_end> for bounding boxes).

(b) Gaze-Regularized Attention Blocks

Egocentric GazeVLM (Pani et al., 24 Oct 2025) and Voila-A (Yan et al., 2023) insert gaze into the attention mechanism via heatmap-derived per-patch priors. In training, VLM attention maps AtA_t are regularized against human gaze distributions H~t\tilde H_{t} using KL-divergence penalties, increasing attention–gaze overlap from ~42% (baseline) to ~68%.

(c) Gaze-Guided Cropping and Feature Fusion

GazeLLM (Rekimoto, 31 Mar 2025) proposes region decomposition: high-res crops around foveal gaze (≈10% of pixels) and downsampled periphery are encoded separately, then fused by a weighted sum for efficient LLM input. This maintains full comprehension with up to 8× memory reduction.

(d) Gaze Shift–Based Saliency and Hallucination Mitigation

GIFT (Qi et al., 24 Oct 2025) computes "gaze shifts" as positive changes in visual attention over salient query tokens, producing a normalized saliency map SS driving both visual and query attention amplification during decoding. This approach reduces caption-level hallucination up to 20.7% (CHAIR CsC_s) and maintains general accuracy.

(e) Sequential Gaze Video Representation

RadEyeVideo (Kim et al., 12 Jul 2025) embeds both the spatial and temporal dynamics of expert gaze in medical imaging. Eye-fixation sequences are rendered as video overlays, encoded via patch or spatio-temporal transformer backbones, and then fused with image features by gated concatenation for report generation and diagnosis.

3. Mathematical Formalizations and Algorithms

Formalisms are tailored to gaze source and model type:

  • Candidate pose set: P={pi=(xi,yi,θi)}P = \{p_i = (x_i, y_i, \theta_i)\} for robotic VLM navigation (Zhu et al., 12 Jul 2024).
  • Gaze heatmaps: Gt(x,y)=i=1Nexp(((xxt(i))2+(yyt(i))2)/(2σ2))G_t(x,y) = \sum_{i=1}^N \exp( - ((x-x_t^{(i)})^2 + (y-y_t^{(i)})^2) / (2\sigma^2) ) (Rekimoto, 31 Mar 2025).
  • Patch-wise attention alignment: DKL(AtH~t)=i=1PAt,ilogAt,iH~t,iD_{\mathrm{KL}}(A_t \Vert \tilde H_t) = \sum_{i=1}^P A_{t,i}\log \frac{A_{t,i}}{\tilde H_{t,i}} (Pani et al., 24 Oct 2025).
  • Gaze shift saliency: S=Normalizemaxmin(tTrmax(0,αtVαt1V))S = \mathrm{Normalize}_{\max-\min}\left(\sum_{t\in T_r} \max(0, \alpha_t^V - \alpha_{t-1}^V)\right) (Qi et al., 24 Oct 2025).
  • Object-level AP: APob=1CcCAPcAP_{ob} = \frac{1}{|\mathcal C|} \sum_{c \in \mathcal C} AP_c, with APcAP_c as area under precision–recall curve for LVIS classes (Mathew et al., 9 Nov 2025).

Algorithmic implementations range from two-stage candidate scoring with VLM prompts (Zhu et al., 12 Jul 2024), early fusion of VLM-extracted cues into transformer person tokens (Gupta et al., 6 Jun 2024), to gated multimodal concatenation after separate vision/video encoders (Kim et al., 12 Jul 2025).

4. Empirical Evaluation and Key Results

GazeVLM methods exhibit robust quantitative gains across domains:

Domain Metric Baseline GazeVLM Relative Gain
Robotics DTG (m) 1.79 0.56 -68.8%
GazeFollow AUC 0.928 0.929 +0.1%
GazeFollow AvgDist 0.122 0.131 Small trade-off
VQA Hallucination CsC_s 50.2% 39.8% -20.7%
Medical CheXbert F1 ref=100 124 +24.6%
Egocentric Semantic Similarity 0.6525 0.7505 +15%

Ablations reveal that RGB+HHA fusion outperforms naive depth inclusion (Mathew et al., 9 Nov 2025), and heatmap-based gaze tokens yield superior grounding versus discrete or bounding-box methods (Yan et al., 2023). Visual prompts with full image and annotated ellipse further improve cue extraction in multi-person gaze following (Gupta et al., 6 Jun 2024). Latency and computational overhead is modest for GIFT (13% increase over greedy decoding (Qi et al., 24 Oct 2025)); gaze-crop pipelines offer order of magnitude reduction in token count with preserved or improved coverage (Rekimoto, 31 Mar 2025).

5. Applications, Limitations, and Future Directions

GazeVLM underpins diverse applications:

  • Active visual navigation and manipulation (robotics, scanning) by scoring candidate camera poses for accessibility and orientation (Zhu et al., 12 Jul 2024).
  • Semantic event anticipation, wearable intelligence, and collaboration, leveraging human gaze for predictive cues in egocentric video (Pani et al., 24 Oct 2025, Rekimoto, 31 Mar 2025).
  • Medical report generation and diagnosis via radiologist eye-movement video (Kim et al., 12 Jul 2025).
  • Real-world QA and object-referencing in AR/VR, aligning model output with user gaze (Yan et al., 2023).
  • Hallucination mitigation in open-domain VLMs using gaze-shift-inspired saliency (Qi et al., 24 Oct 2025).

Limitations include reliance on high-quality gaze data (misalignment degrades attention regularization (Pani et al., 24 Oct 2025)), inability to exploit temporal continuity in some static-only frameworks (Mathew et al., 9 Nov 2025), susceptibility to prompt-format dependencies, and, occasionally, increased latency or memory footprint. Failure modes noted in counting, high-detail OCR, and certain ambiguous spatial queries remain open challenges.

Anticipated research directions encompass video-level gaze tracking, attention-aware multimodal fusion (e.g., deformable patches, multi-resolution encoding), multi-modal expansion (voice, gestures), efficient model distillation for real-time operation, and self-supervised learning on large unlabelled corpora.

6. Interpretations and Open Questions

GazeVLM systems suggest a unifying abstraction: gaze—whether human, machine-predicted, or model-internal—serves as a proxy for cognitive attention, enhancing selective fusion, semantic grounding, and anticipatory understanding. A plausible implication is that future VLMs will operate in continuous perception–action cycles, with gaze priors dynamically steering modality fusion and language output. The convergence of third-person and egocentric gaze pipelines remains an unresolved frontier, notably in multi-agent social interaction and embodied collaboration scenarios.

7. Comparative Table of Major GazeVLM Paradigms

Name Integration Method Tasks Key Results
GazeVLM (Mathew et al., 9 Nov 2025) VLM fusion, RGB+HHA, prompts Person/gaze/object detection AUC 0.929 (GF), APob_{ob} 0.23
Gaze-VLM (Pani et al., 24 Oct 2025) Gaze-regularized attention Future event, activity prediction +11%/7% semantic gain, C_I ↓ 32%
GazeLLM (Rekimoto, 31 Mar 2025) Gaze-crop fusion Activity parsing, instruction gen. ≃Full metric, −10× tokens
GIFT (Qi et al., 24 Oct 2025) Gaze-shift saliency Hallucination mitigation −20.7% CHAIR CsC_s, +11.7% MMHal
Voila-A (Yan et al., 2023) Gaze heatmap perceiver block QA grounding, AR/VR agent Highest helpfulness/grounding
RadEyeVideo (Kim et al., 12 Jul 2025) Gaze-video (sequential) CXR report/gen, diagnosis +24.6% (F1); passes domain LLMs
Navi2Gaze (Zhu et al., 12 Jul 2024) VLM pose scoring, prompt Navigation, object orientation DTG -68.8%, SR +0.57

Editor's term. This table synthetically distinguishes GazeVLM approaches by their integration and principal task domains.


GazeVLM research demonstrates that augmenting vision-LLMs with gaze—in all its forms—improves the selectivity, grounding, and anticipation of AI systems in vision, language, and action tasks. Continued progress in gaze signal acquisition, multi-modal fusion, and efficient architecture design is expected to further enhance real-world applications in robotics, medicine, social AI, and human-computer interaction.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to GazeVLM.