Gaze-Guided Streaming Video Understanding

Updated 2 December 2025

The paper introduces a gaze-guided streaming framework that integrates eye-tracking data with video streams to improve temporal reasoning and proactive intent prediction.
It employs architectures like gaze-augmented transformers and graph-based models to selectively amplify human-attended regions and enhance multimodal alignment.
The study demonstrates applications in AR/VR, bandwidth optimization, and egocentric monitoring while addressing challenges like gaze noise and real-time processing latency.

Gaze-guided streaming video understanding is a research area focused on leveraging human gaze signals to enhance and inform computational interpretation of temporally incoming visual data. By conditioning model inference on both the visual video stream and a synchronized stream of eye gaze data, these systems aim to approach or replicate the selective and proactive behavior characteristic of human attention and intention inference in dynamic tasks and environments. Recent work has positioned gaze guidance as a critical component for addressing the challenges of real-time temporal reasoning, proactive intent prediction, and multimodal alignment in a variety of domains from egocentric assistance to AR/VR and task automation (Lee et al., 1 Dec 2025, Peng et al., 9 Sep 2025, Lall et al., 3 Nov 2025).

1. Formal Definition and Problem Structure

Gaze-guided streaming video understanding extends classical video question answering (QA) by requiring the model $\mathcal{M}$ to condition on a causal window of both video frames $\mathcal{V} = \{v_t\}_{t=1}^T$ and gaze coordinates $\mathcal{G} = \{(x_t, y_t)\}_{t=1}^T$ . At any query time $t_q$ , the model is tasked with one of the following:

Past Reasoning: Use the visual-gaze history $\mathcal{V}_{[1,t_q]}, \mathcal{G}_{[1,t_q]}$ to answer about completed events.
Present Reasoning: Focus on a fixed, recent window $(t_q-\omega, t_q]$ , describing the current perceptual state.
Proactive Prediction: Anticipate post- $t_q$ events or user intentions using only past and current observations.

These tasks are formulated as

$A = \mathcal{M}(Q; \mathcal{V}_{W}, \mathcal{G}_{W})$

where $W$ denotes the relevant window depending on the task (Lee et al., 1 Dec 2025). This structure enables evaluation of a model’s capacity for causal temporal reasoning, short-term contextualization, and forward-looking intention modeling—all under real-time constraints.

2. Gaze–Video Data Acquisition and Alignment

Realizing gaze-guided streaming evaluation requires precise collection and alignment of multimodal data:

Eye-tracking hardware operates at high frequencies (up to 1,200 Hz for in-lab setups, e.g., with Tobii Spectrum; 60–144 Hz in AR/VR headsets) and provides frame-synchronized 2D image or 3D world gaze vectors (Ozdel et al., 10 Apr 2024, Lee et al., 1 Dec 2025, Hu et al., 2023).
Fixation extraction involves segmenting the raw gaze stream into intervals of stable attention, typically via spatial ( $\ell_2$ threshold) and temporal (minimum duration) constraints, and validated by scene consistency thresholds (e.g., HSV histogram correlations) to reject spurious or in-motion gaze events.
Region-specific visual prompting is performed by defining a foveal region $\mathcal{R}_{i,t}^\mathrm{fov}$ centered at the fixation and extracting both foveal and peripheral crops per frame for object-level annotation and feature extraction (Lee et al., 1 Dec 2025).

Scanpaths—ordered sequences of fixated object regions—are constructed and paired with semantically grounded QA regarding both attended (foveal) and unattended (peripheral) content.

3. Algorithmic Approaches and Model Architectures

Multiple architectures have arisen for leveraging gaze in streaming settings:

Gaze-augmented transformers: Visual patch embeddings are modulated by gaze via explicit bias terms in the self-attention computation, as in "Eyes on Target," which injects normalized gaze, depth, direction, and pupil features into multi-head attention, selectively amplifying weights for human-attended regions (Lall et al., 3 Nov 2025).
Graph-based reasoning: The Gaze-Guided Action Anticipation framework constructs a sparse visual-semantic graph where nodes represent distinct, gaze-fixated scene patches, and temporal/semantic edge attributes encode fixation transitions. Graph Neural Networks (GNNs) with edge-conditioned convolutions propagate context and support intent-conditioned action sequence decoding (Ozdel et al., 10 Apr 2024).
Spatio-temporal attention: I3D-based video models are augmented with 3D attention modules trained to mimic human gaze heatmaps, dynamically reweighting feature volumes and empirically improving fine-grained activity recognition, as shown for surgical video (Awale et al., 2022).
Foveated MLLM pipelines: Gaze-centered cropping strategies (e.g., 448×448 crops around gaze, ≈10% of original pixels) provide efficient input representation, enabling Multimodal LLMs to match or surpass conventional full-frame baselines on comprehension with a substantial reduction in token count and memory (Rekimoto, 31 Mar 2025).

A table summarizing dominant approach classes:

Paper/model	Architectural backbone	Gaze integration mechanism
StreamGaze (Lee et al., 1 Dec 2025)	InternVL-3.5, Qwen2.5-VL	Region-based visual prompts, salience maps for QA grounding
Eyes on Target (Lall et al., 3 Nov 2025)	DETR (ResNet + ViT)	Explicit gaze-bias in attention, depth/direction fusion
Gaze-Guided Action Anticipation (Ozdel et al., 10 Apr 2024)	CLIP + ECC-GNN	Foveal crops/graph nodes, object semantics and message passing
GazeLLM (Rekimoto, 31 Mar 2025)	Gemini 1.5 Pro (ViT)	Gaze-driven foveal cropping, patch tokenization
Surgical Activity (Awale et al., 2022)	I3D (Inflated 3D CNN)	Gaze-supervised 3D attention map, feature reweighting

4. Dataset Construction and Benchmark Design

Cutting-edge datasets for gaze-guided streaming video understanding are characterized by:

Rich spatio-temporal annotation: E.g., StreamGaze provides 8,521 QA pairs over 285 egocentric videos (mean 815 s), with object labels carefully aligned to fixations using a generation pipeline comprising projection, fixation extraction, region cropping, object detection (via MLLM), and LLM-formulated task specification (Lee et al., 1 Dec 2025).
Multitask coverage: Tasks include temporal reasoning (object transition, new fixation inference), contextual recall, future action prediction, proactive alerting, and scene/background recall, with sample counts tracked across tasks for rigorous evaluation.
Zero-shot and cross-domain splits: 70/10/20 training/validation/testing protocol is standard, supporting both raw generalization assessment and future finetuning experiments.

Human gaze data is also collected in VR (volumetric video (Hu et al., 2023)) and 360-degree streaming (Jin et al., 2022), enabling real-world streaming experiments across immersive and egocentric paradigms.

5. Evaluation Protocols and Baseline Comparisons

Evaluation in this domain is nuanced, centering on a suite of metrics:

Accuracy for multiple-choice QA (past/present tasks), computed as the fraction of correctly predicted responses over $N$ samples.
Precision and recall for proactive alerts, measuring correct decision rates under class imbalance.
Error decomposition into false positives/negatives to dissect performance in temporal event detection, intention anticipation, and fixation-driven prediction (Lee et al., 1 Dec 2025, Peng et al., 9 Sep 2025).

Recent benchmarks demonstrate a persistent 30–40 point gap between state-of-the-art models (e.g., GPT-4o/InternVL-3.5/Qwen2.5-VL) and human performance, particularly pronounced in temporal and proactive understanding. Salience map aggregation strategies provide incremental improvements but do not surpass gaze-agnostic baselines, suggesting that current MLLM prompt and architecture choices only partially capture human attentional structure. Fine-tuned models with explicit multimodal adapters offer improved adaptation to gaze signals, especially when spatial and temporal cues are fused (Peng et al., 9 Sep 2025).

6. Applications, Extensions, and System-Level Impact

Gaze-guided video understanding underpins multiple application domains:

Intent-aware AR/VR agents: StreamGaze and EgoGazeVQA demonstrate scenarios in which gaze signals allow proactive retrieval and guidance by real-world AR assistants, including real-time alerting and context-specific Q/A (Lee et al., 1 Dec 2025, Peng et al., 9 Sep 2025).
Bandwidth-optimized streaming: By foveating predicted user gaze in tile-based 360° and volumetric video, systems achieve substantial bandwidth savings (20–40%) and significant reductions in stall rates, while preserving or improving perceived QoE (Hu et al., 2023, Jin et al., 2022).
Egocentric activity monitoring: Eyes on Target leverages gaze to improve online object detection and classification in wearable scenarios (e.g., assessment in simulated environments), introducing gaze-aware attention head importance metrics to interpret model behavior (Lall et al., 3 Nov 2025). Foveated MLLMs (GazeLLM) process only gaze-predicted patches, reducing both pixel count and transformer token count without accuracy loss (Rekimoto, 31 Mar 2025).
Semantic 3D mapping: The “Embodied Semantic Fovea” system fuses gaze projections with real-time mapping for object instance tracking in unconstrained, ego-centric tasks (Li et al., 2018).

7. Open Challenges and Future Research Directions

The field continues to face several limitations and open problems:

Temporal memory and proactive inference: Current MLLMs and streaming models lack robust architectures for long-horizon, gaze-conditioned evidence accumulation and struggle to anticipate intent from early peripheral cues (Lee et al., 1 Dec 2025).
Model–human gap: Even with explicit gaze supervision and advanced prompting, performance in causal and proactive understanding remains substantially below human levels, with specific deficits in sequence modeling and intention decoding.
Integration of multimodal attention: Advances require explicit gaze–visual–textual chain-of-thought, adaptive prompting strategies, and robust fusion at both token and feature levels (Peng et al., 9 Sep 2025, Rekimoto, 31 Mar 2025).
Robustness to gaze noise and latency: Spatial reasoning with noisy gaze fails to provide benefit, and salience aggregation introduces expensive latency-accuracy tradeoffs (Peng et al., 9 Sep 2025).
Privacy and compute: Persistent gaze logging and real-time inference bring privacy, security, and edge-deployment challenges.

Further progress is expected from architectures combining explicit gaze–attention modules, task-adaptive sub-policies, and scalable multimodal memory systems, facilitating human-aligned, intention-aware streaming agents (Lee et al., 1 Dec 2025). Public releases of datasets and codebases (e.g., StreamGaze at https://github.com/daeunni/StreamGaze) aim to accelerate systematic paper and deployment.