GazeVLM: Gaze-Augmented Vision-Language Models
- GazeVLM is a methodology that integrates human or simulated gaze signals into vision-language models to provide enhanced grounding and spatial reasoning.
- It employs techniques such as gaze-conditioned fusion, regularized attention, and gaze-guided cropping to improve semantic prediction and reduce computational load.
- The integration leads to robust empirical gains in domains like robotics, medical imaging, and AR/VR by effectively anchoring visual attention and mitigating ambiguities.
GazeVLM encompasses a class of methodologies and systems that integrate eye gaze and visual attention cues—either human, machine-simulated, or derived from visual-LLM (VLM) attention dynamics—into the architecture and training of vision-LLMs for active perception, multimodal understanding, and embodied AI. These approaches leverage gaze as a privileged signal for grounding, anticipation, hallucination mitigation, and efficient encoding, transcending traditional bottom-up feature extraction. Distinct paradigms include third-person gaze understanding, egocentric activity analysis, medical imaging integration, robotics navigation, and user-guided perception. GazeVLM systems utilize diverse strategies such as gaze-conditioned fusion, gaze-regularized attention, gaze-informed spatial cropping, and gaze-shift-derived saliency computation. Founded on transformer VLM backbones (BLIP-2, Qwen2-VL, OpenFlamingo, etc.), these architectures demonstrate robust empirical gains in semantic prediction, interpretability, and efficiency across static imagery, video, and interactive scenarios.
1. Foundations and Problem Taxonomy
GazeVLM originated from independent lines of research seeking to unify person detection, gaze localization, and object-of-attention identification within VLMs (Mathew et al., 9 Nov 2025). The core motivation is that gaze conveys high-value information: in third-person settings, it reveals "who looks where at what," while in egocentric contexts, it exposes intent and future actions (Pani et al., 24 Oct 2025). Conventional vision-only pipelines struggle with occlusion, crop misalignments, and brittle semantic grounding, whereas gaze augments spatial reasoning with cognitive priors. Tasks addressed by GazeVLM systems include:
| Setting | Task Set | Signal Source |
|---|---|---|
| Third-person | Person detection, gaze-point regression, | Image, depth, text prompt, |
| gaze-object identification | HHA maps | |
| Egocentric | Future event anticipation, activity parsing | Eye-tracking heatmaps |
| Robotics | Navigation pose selection, orientation | VLM zero-shot scoring |
| Medical (CXR) | Report generation, diagnosis | Radiologist gaze video |
| Human-AI UI | QA grounding, object referencing | AR/VR gaze |
The taxonomic feature uniting these systems is their explicit or implicit use of gaze-derived spatial priors to guide model attention, output selection, or efficient representation.
2. Model Architectures and Gaze Integration Techniques
GazeVLM architectures span multiple integration paradigms:
(a) Cross-Modal Fusion with Gaze Signals
GazeVLM for multi-task gaze understanding (Mathew et al., 9 Nov 2025), built on Qwen2-VL-2B, fuses RGB and HHA depth features in a frozen vision encoder, concatenating outputs for cross-attention in a text decoder. All tasks (detection, localization, identification) are posed as sequence generation conditional on a textual prompt (e.g., <box_start>…<box_end> for bounding boxes).
(b) Gaze-Regularized Attention Blocks
Egocentric GazeVLM (Pani et al., 24 Oct 2025) and Voila-A (Yan et al., 2023) insert gaze into the attention mechanism via heatmap-derived per-patch priors. In training, VLM attention maps are regularized against human gaze distributions using KL-divergence penalties, increasing attention–gaze overlap from ~42% (baseline) to ~68%.
(c) Gaze-Guided Cropping and Feature Fusion
GazeLLM (Rekimoto, 31 Mar 2025) proposes region decomposition: high-res crops around foveal gaze (≈10% of pixels) and downsampled periphery are encoded separately, then fused by a weighted sum for efficient LLM input. This maintains full comprehension with up to 8× memory reduction.
(d) Gaze Shift–Based Saliency and Hallucination Mitigation
GIFT (Qi et al., 24 Oct 2025) computes "gaze shifts" as positive changes in visual attention over salient query tokens, producing a normalized saliency map driving both visual and query attention amplification during decoding. This approach reduces caption-level hallucination up to 20.7% (CHAIR ) and maintains general accuracy.
(e) Sequential Gaze Video Representation
RadEyeVideo (Kim et al., 12 Jul 2025) embeds both the spatial and temporal dynamics of expert gaze in medical imaging. Eye-fixation sequences are rendered as video overlays, encoded via patch or spatio-temporal transformer backbones, and then fused with image features by gated concatenation for report generation and diagnosis.
3. Mathematical Formalizations and Algorithms
Formalisms are tailored to gaze source and model type:
- Candidate pose set: for robotic VLM navigation (Zhu et al., 12 Jul 2024).
- Gaze heatmaps: (Rekimoto, 31 Mar 2025).
- Patch-wise attention alignment: (Pani et al., 24 Oct 2025).
- Gaze shift saliency: (Qi et al., 24 Oct 2025).
- Object-level AP: , with as area under precision–recall curve for LVIS classes (Mathew et al., 9 Nov 2025).
Algorithmic implementations range from two-stage candidate scoring with VLM prompts (Zhu et al., 12 Jul 2024), early fusion of VLM-extracted cues into transformer person tokens (Gupta et al., 6 Jun 2024), to gated multimodal concatenation after separate vision/video encoders (Kim et al., 12 Jul 2025).
4. Empirical Evaluation and Key Results
GazeVLM methods exhibit robust quantitative gains across domains:
| Domain | Metric | Baseline | GazeVLM | Relative Gain |
|---|---|---|---|---|
| Robotics | DTG (m) | 1.79 | 0.56 | -68.8% |
| GazeFollow | AUC | 0.928 | 0.929 | +0.1% |
| GazeFollow | AvgDist | 0.122 | 0.131 | Small trade-off |
| VQA | Hallucination | 50.2% | 39.8% | -20.7% |
| Medical | CheXbert F1 | ref=100 | 124 | +24.6% |
| Egocentric | Semantic Similarity | 0.6525 | 0.7505 | +15% |
Ablations reveal that RGB+HHA fusion outperforms naive depth inclusion (Mathew et al., 9 Nov 2025), and heatmap-based gaze tokens yield superior grounding versus discrete or bounding-box methods (Yan et al., 2023). Visual prompts with full image and annotated ellipse further improve cue extraction in multi-person gaze following (Gupta et al., 6 Jun 2024). Latency and computational overhead is modest for GIFT (13% increase over greedy decoding (Qi et al., 24 Oct 2025)); gaze-crop pipelines offer order of magnitude reduction in token count with preserved or improved coverage (Rekimoto, 31 Mar 2025).
5. Applications, Limitations, and Future Directions
GazeVLM underpins diverse applications:
- Active visual navigation and manipulation (robotics, scanning) by scoring candidate camera poses for accessibility and orientation (Zhu et al., 12 Jul 2024).
- Semantic event anticipation, wearable intelligence, and collaboration, leveraging human gaze for predictive cues in egocentric video (Pani et al., 24 Oct 2025, Rekimoto, 31 Mar 2025).
- Medical report generation and diagnosis via radiologist eye-movement video (Kim et al., 12 Jul 2025).
- Real-world QA and object-referencing in AR/VR, aligning model output with user gaze (Yan et al., 2023).
- Hallucination mitigation in open-domain VLMs using gaze-shift-inspired saliency (Qi et al., 24 Oct 2025).
Limitations include reliance on high-quality gaze data (misalignment degrades attention regularization (Pani et al., 24 Oct 2025)), inability to exploit temporal continuity in some static-only frameworks (Mathew et al., 9 Nov 2025), susceptibility to prompt-format dependencies, and, occasionally, increased latency or memory footprint. Failure modes noted in counting, high-detail OCR, and certain ambiguous spatial queries remain open challenges.
Anticipated research directions encompass video-level gaze tracking, attention-aware multimodal fusion (e.g., deformable patches, multi-resolution encoding), multi-modal expansion (voice, gestures), efficient model distillation for real-time operation, and self-supervised learning on large unlabelled corpora.
6. Interpretations and Open Questions
GazeVLM systems suggest a unifying abstraction: gaze—whether human, machine-predicted, or model-internal—serves as a proxy for cognitive attention, enhancing selective fusion, semantic grounding, and anticipatory understanding. A plausible implication is that future VLMs will operate in continuous perception–action cycles, with gaze priors dynamically steering modality fusion and language output. The convergence of third-person and egocentric gaze pipelines remains an unresolved frontier, notably in multi-agent social interaction and embodied collaboration scenarios.
7. Comparative Table of Major GazeVLM Paradigms
| Name | Integration Method | Tasks | Key Results |
|---|---|---|---|
| GazeVLM (Mathew et al., 9 Nov 2025) | VLM fusion, RGB+HHA, prompts | Person/gaze/object detection | AUC 0.929 (GF), AP 0.23 |
| Gaze-VLM (Pani et al., 24 Oct 2025) | Gaze-regularized attention | Future event, activity prediction | +11%/7% semantic gain, C_I ↓ 32% |
| GazeLLM (Rekimoto, 31 Mar 2025) | Gaze-crop fusion | Activity parsing, instruction gen. | ≃Full metric, −10× tokens |
| GIFT (Qi et al., 24 Oct 2025) | Gaze-shift saliency | Hallucination mitigation | −20.7% CHAIR , +11.7% MMHal |
| Voila-A (Yan et al., 2023) | Gaze heatmap perceiver block | QA grounding, AR/VR agent | Highest helpfulness/grounding |
| RadEyeVideo (Kim et al., 12 Jul 2025) | Gaze-video (sequential) | CXR report/gen, diagnosis | +24.6% (F1); passes domain LLMs |
| Navi2Gaze (Zhu et al., 12 Jul 2024) | VLM pose scoring, prompt | Navigation, object orientation | DTG -68.8%, SR +0.57 |
Editor's term. This table synthetically distinguishes GazeVLM approaches by their integration and principal task domains.
GazeVLM research demonstrates that augmenting vision-LLMs with gaze—in all its forms—improves the selectivity, grounding, and anticipation of AI systems in vision, language, and action tasks. Continued progress in gaze signal acquisition, multi-modal fusion, and efficient architecture design is expected to further enhance real-world applications in robotics, medicine, social AI, and human-computer interaction.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free