Gaze-Video QA Generation Pipeline

Updated 2 December 2025

Gaze-Video QA Generation Pipeline is a structured method leveraging synchronized video and gaze data to generate temporally aligned question–answer pairs.
The pipeline employs techniques like gaze preprocessing, ROI extraction, and multiple encoding strategies to guide robust multimodal QA synthesis.
Integration of human validation and fine-tuning methods ensures precise benchmarking and improved reasoning in vision–language models.

A Gaze-Video QA Generation Pipeline is a structured procedure that synthesizes question–answer pairs linked directly to visual content and temporally aligned gaze signals, typically for benchmarking or training multimodal vision–LLMs to reason about what, when, and why humans attend to particular elements in videos. The fundamental concept is to leverage synchronously recorded gaze trajectories—sometimes as raw coordinates, video renderings, or region-of-interest (ROI) annotations—and to use these as explicit, information-rich signals for guiding both QA content and reasoning context in downstream models.

1. Data Acquisition, Gaze Preprocessing, and Fixation Extraction

The process begins by synchronizing an egocentric video $\mathcal{V} = \{v_t\}_{t=1}^T$ with a co-recorded gaze signal $\mathcal{G}$ , which may be 2D image-plane gaze points $(x_t, y_t)$ , 3D gaze rays $(o_t, d_t)$ , or higher-level oculomotor events (Peng et al., 9 Sep 2025, Lee et al., 1 Dec 2025).

Temporal Alignment: Gaze samples are temporally aligned to video frames via nearest neighbor interpolation or by leveraging hardware timestamps.
Coordinate Normalization: Gaze pixels $(\tilde{x}_t, \tilde{y}_t)$ on frames of size $W\times H$ are normalized as $x_t = \tilde{x}_t / W, \quad y_t = \tilde{y}_t / H$ .
Fixation Extraction: Sequences of gaze points are grouped into fixations $\mathcal{F} = \{f_i\}$ $F = {f_{i}}$ using spatial dispersion, minimum duration filtering, and scene continuity checks. A window $[t_i^s, t_i^e]$ $[t_{i}^{s}, t_{i}^{e}]$ qualifies as a fixation if,
- Spatial: $d_t = ||(x_t, y_t) - (\bar{x}_i, \bar{y}_i)||_2 \leq r_{\text{thresh}}$
- Duration: $t_i^e - t_i^s \geq \tau_{\text{dur}}$
- Scene Consistency: For all $t$ in the fixation, the minimum Pearson correlation $S_{\min}$ of the color histogram with adjacent frames exceeds $\tau_{\text{scene}}$ , typically 0.9 (Lee et al., 1 Dec 2025).

2. Eye-Gaze–Video Representation Strategies

Various encoding strategies are employed to represent gaze-grounded cues alongside visual context:

Visual Marker Encoding: Overlaying a red dot or circle of fixed radius at each fixation point directly on video frames. Duration-weighting can be applied by allocating a frame count $F_i = t_i \times fps$ per fixation, leading to a sequence $V_{\text{full}} = \{v_1, ..., v_{F_\text{total}}\}$ subsequently uniformly subsampled to fit model budget $k$ (Kim et al., 12 Jul 2025).
Textual/Coordinate Embedding: Serializing the gaze trace as text for each frame, e.g., "Frame $i$ : Gaze( $x_i$ , $y_i$ )" (Peng et al., 9 Sep 2025).
Sequential Salience Map: Collapsing the gaze trace into a single heatmap wherein recent fixations receive higher intensity, constructed by adding recency-weighted Gaussians and smoothing by Gaussian blur (Peng et al., 9 Sep 2025).

In clinical imaging domains such as CXR, the entire static background is held fixed and gaze is rendered as a moving dot, offering an interpretable spatiotemporal map for grounding generation (Kim et al., 12 Jul 2025).

Encoding Variant	Input Modality	Salient Features
Visual Marker (GazeV)	Images + Markers	Spatiotemporal, localized
Textual Coord. (GazeT)	Image + Text	Explicit serial representation
Salience Map (GazeS)	Heatmap + Images	Temporal progression, recency

This table summarizes the gaze encoding taxonomy that provides multimodal context for QA synthesis.

3. Object-, Region-, and Scanpath-Driven Prompt Construction

Fixation-centric processing enables extraction of foveated (centered on fixation) and peripheral (outside fixation) ROIs per video frame. This commonly involves:

Region-specific object extraction: Each fixation $f_i$ defines foveal region $\mathcal{R}_{i,t}^{\text{fov}} = \{(x, y) \mid ||(x, y) - (\bar{x}_i, \bar{y}_i)||_2 \leq \tau_\text{fov}\}$ and out-of-FOV region $\mathcal{R}_{i,t}^{\text{out}} = v_t \setminus \mathcal{R}_{i,t}^{\text{fov}}$ .
Objects and attributes are extracted per region via prompting powerful multimodal LLMs (e.g., InternVL-3.5-38B) (Lee et al., 1 Dec 2025).
Scanpath Construction: The sequence of $(\mathcal{O}_i^{\text{fov}}, \mathcal{O}_i^{\text{out}})$ pairs is temporally ordered, providing a structural representation of perceptual attention throughout the clip.

Prompt templates explicitly incorporate gaze either as visual overlays, textual coordinates, or as separate heatmap information, often specifying task context (“Generate a spatial/temporal/causal multiple-choice question…”), and instruct the model to generate QAs grounded in the provided gaze trajectory and visual scene (Peng et al., 9 Sep 2025).

4. QA Pair Synthesis and Human Verification

The pipeline iteratively generates QAs from scanpaths and region-object pools using large vision–LLMs (LVLMs or MLLMs):

Initial Generation: The prompt, including video frames (possibly with gaze markers), gaze tracks/heatmaps/text, and scenario/task context, is submitted to a model such as Qwen2.5-VL, InternVL, or LLaVA-OneVision (Kim et al., 12 Jul 2025, Lee et al., 1 Dec 2025).
Task Taxonomy:
- Past: Queries over the historical trajectory (e.g., which objects have not yet been fixated, predicting the next object, sequence order).
- Present: Queries about the current fixation (object identity, attribute recognition).
- Proactive: Future prediction tasks (e.g., gaze-triggered alerts, object appearance alerts) (Lee et al., 1 Dec 2025).
Distractor Sampling: Distractor answer options are sampled from non-fixated or peripheral object pools, and ambiguous options are filtered using additional LLM-based ranking.
Human-in-the-loop validation: Each QA is manually vetted for relevance, answerability, correctness, fluency, conciseness, and appropriate difficulty by multiple annotators (Peng et al., 9 Sep 2025, Lee et al., 1 Dec 2025). Failure rates are tracked, with inaccuracy, irrelevance, and unanswerability being the dominant error types.

5. Model Training, Prompting, and Fine-Tuning

Model adaptation to the gaze-video QA setting varies by implementation:

In-context learning/Zero-shot prompting: Models receive as input the completed gaze-video sequence plus textual context/exemplars and generate QAs via standard autoregressive decoding, without further parameter updates (Kim et al., 12 Jul 2025).
LoRA-based fine-tuning: Low-rank adaptation modules (e.g., rank-16 or rank-128) are used with large MLLMs freeze all other weights. Substantial gains are observed in spatial and temporal QA accuracy following LoRA fine-tuning, confirming the utility of gaze as a supervision signal (Peng et al., 9 Sep 2025, Lee et al., 1 Dec 2025).
No pipeline described in the surveyed papers introduces new loss functions for QA beyond those inherent to the decoder sequence loss, except where multiple QAs are generated and ranked by negative log-likelihood for answer set selection.

6. Evaluation Protocols and Benchmarks

Performance is measured through scenario- and task-specific metrics:

Accuracy: Multiple-choice exact match for spatial, temporal, and causal QAs; fuzzy matching is applied for synonyms (Lee et al., 1 Dec 2025).
Scaled efficacy metrics: In clinical QA/report generation, scores (e.g., CheXbert micro-F1@5, RadGraph-XL F1, RaTEScore) are normalized to a baseline, e.g., CheXagent, using $\hat{S}_{m,i} = (S_{m,i}/S_{m,\text{reference}}) \times 100$ (Kim et al., 12 Jul 2025). Gains up to +54.6% over baselines are observed for gaze-video-augmented models.
Gaze estimation impact: The relationship between gaze accuracy (mean squared error to ground-truth) and QA accuracy delta ( $\Delta_{acc}$ ) is quantified.
Human upper bound: Annotators’ QA performance is measured for reference (e.g., 83.8% in EgoGazeVQA) (Peng et al., 9 Sep 2025).
Alert tasks: Multi-checkpoint accuracy is applied to streaming proactive tasks (Lee et al., 1 Dec 2025).

7. Pipeline Pseudocode and System Overview

A unified high-level pseudocode for the gaze–video QA generation procedure is as follows:

for each (video_clip, gaze_log):
    # 1. Fixation extraction
    fixation_list = extract_fixations(gaze_log)
    # 2. Gaze-video construction (e.g., overlay fixations or assemble salience map)
    gaze_video = render_gaze_video(video_clip, fixation_list)
    # 3. Region/scanpath assembly and object pool extraction (optional)
    scanpath = []
    for fixation in fixation_list:
        objects_fov = extract_objects(region_of_interest(video_clip, fixation))
        objects_out = extract_objects(complement_region(video_clip, fixation))
        scanpath.append((objects_fov, objects_out))
    # 4. Prompt assembly for LLM/MLLM
    prompt = construct_prompt(video_clip, gaze_video, scanpath, task_context)
    # 5. Model inference: QA generation
    candidate_QA_pairs = model.generate_QA(prompt)
    # 6. Human annotation: select/filter/refine QAs
    validated_QA_pairs = validate_QA(candidate_QA_pairs)
    # 7. Evaluate performance on held-out splits/models
    score = compute_metrics(validated_QA_pairs, ground_truth)

This flow is instantiated in RadEyeVideo for medical LVLMs, EgoGazeVQA and StreamGaze for egocentric and streaming settings, and robustly accommodates QA, report generation, diagnosis, and intention modeling tasks (Kim et al., 12 Jul 2025, Peng et al., 9 Sep 2025, Lee et al., 1 Dec 2025).

The gaze-video QA generation pipeline is foundational for evaluating and advancing multimodal models’ integration of fine-grained attention cues. Across diverse domains—in clinical imaging, egocentric video, and streaming AR scenarios—such pipelines operationalize temporally grounded, spatially aware, and intent-sensitive metrics for both automated and human-evaluated question answering. These advances drive more precise benchmarking of model capabilities in dynamic, human-centered, real-world video understanding.