Grounded Video Question Answering

Updated 10 November 2025

Grounded Video Question Answering is a multi-modal domain that aligns answers with specific visual cues, frames, or temporal segments in a video.
It employs methods like mask-based attention, scene graphs, and multi-agent architectures to tightly couple answer generation with evidence selection.
Key innovations enhance interpretability and causal inference, though challenges remain in handling noise, occlusion, and multi-event scenarios.

Grounded Video Question Answering (GVQA) is a specialized domain of multi-modal learning which seeks to answer natural-language questions about videos by explicitly aligning the predicted answers with precise visual or spatio-temporal evidence within the video. Unlike traditional VideoQA that often relies on abstracted or context-only representations, GVQA requires models to discover, highlight, and justify which specific visual cues, frames, objects, or temporal segments support the answer. This task is motivated by the need for both trustworthy inference and interpretable, evidence-based predictions in video understanding, especially in settings where biased or superficial correlations can reduce reliability.

1. Problem Formulation and Task Variants

GVQA builds upon standard VideoQA by enforcing grounding objectives:

Input: A video $V = \{f_t\}_{t=1}^T$ and a natural-language question $Q$ , with optionally multiple-choice or open-ended answer formats.
Output: An answer $A$ and a grounding annotation $G$ ; typically, $G$ is a segment, bounding box, set of frames, or a structured scene graph indicating which part(s) of the video support $A$ .

Formally, models learn:

$M:\bigl(V,\,Q\bigr)\;\longrightarrow\;(A,\,G)$

Depending on dataset and research context:

Temporal GVQA: $G = (s, e)$ , a time interval
Spatio-temporal GVQA: $G = \{ (t, b) \}$ , frame indices and spatial boxes
Symbolic grounding: $G = (\mathcal V, \mathcal E)$ , scene graphs of objects and relations
Commonsense reasoning: $G$ can be an entailment-tree-linked set of moments

Benchmarks such as NExT-GQA (Xiao et al., 2023), DeVE-QA (Dang et al., 22 Jun 2025), ViTXT-GQA (Zhou et al., 2024), and REAL-Colon-VQA (Drago et al., 5 Nov 2025) provide temporally or spatially grounded annotations for rigorous evaluation.

2. Grounding Architectures and Learning Objectives

GVQA model design centers on two core challenges: (1) identifying the question-critical regions or moments, and (2) tightly coupling answer generation with evidence selection. Prevailing approaches include:

Mask-Based Grounding: Learnable Gaussian masks parameterize temporal attention $M_t = \exp[-(t-\mu)^2/(2\sigma^2)]$ (Xiao et al., 2023, Wang et al., 2024). The masked segment is used for VQA loss and weak grounding supervision.

Joint loss (e.g., (Xiao et al., 2023)):

$L_{total} = L_{QA} + \alpha L_{ground} + \beta L_{align}$

Scene Graph and Symbolic Reasoning: Build explicit scene graphs $G = (\mathcal V, \mathcal E)$ where $\mathcal V$ are object nodes and $\mathcal E$ are semantic or spatial relations. Models such as SG-VLM (Ma et al., 15 Sep 2025) integrate frozen VLMs with graph encoding and multi-modal fusion Transformers:

$z = \mathrm{Transformer}\Bigl[\mathrm{CLS}; E_q(q); E_v(V); E_s(G)\Bigr]$

Multi-path Agentic Models: MUPA (Dang et al., 22 Jun 2025) introduces multiple agents (Grounder, Answerer, GQA, Reflection), reasoning along three paths (ground-to-answer, answer-to-ground, joint) with a reflection module for reliability. Aggregation combines answer confidences and grounding segment via weighted k-means fusion.
Contrastive and InfoNCE Objectives: Weakly supervised models apply contrastive losses (InfoNCE) between relevant and irrelevant frames/tokens, e.g., maximizing similarity of question-guided video features to correct answer region (Wang et al., 2024, Zhou et al., 2024).
Intrinsic Interpretable Grounding: EIGV (Li et al., 2022) and IGV (Li et al., 2022) use SCM causal interventions: partition clips into "causal" and "environment," enforce equivariant/invariant objectives, and penalize answer changes due to non-causal frames.

3. Dataset Development and Evaluation Protocols

Essential GVQA datasets incorporate explicit grounding:

NExT-GQA (Xiao et al., 2023): 10.5K temporal segments tied to QA pairs; annotators label the minimal moments justifying each answer.
ViTXT-GQA (Zhou et al., 2024): For scene-text grounding; 52K bounding boxes over 2.2K temporal regions per 2K questions and 729 videos.
REAL-Colon-VQA (Drago et al., 5 Nov 2025): Surgical sequences, 5.2K multi-domain clips with both short and long answers, and movement-related grounding.
Ego4D-NLQ (Di et al., 2023): Egocentric long videos with question-anchored temporal spans.

Evaluation metrics are designed to jointly test answer correctness and grounding precision:

Metric	Description
Acc@QA	Standard QA accuracy
Acc@GQA	QA correctness with IoP/Iou≥0.5 for grounded evidence
mIoP	Mean Intersection-over-Prediction (temporal)
[email protected]/0.3	Bounding box/frame overlap for spatial grounding
[email protected]	Joint grounding+QA at IoU = 0.5
Keyword-Accuracy	Specialized (e.g., surgical terms in answers)

Human performance on NExT-GQA: Acc@QA ~93%, Acc@GQA ~82%; Current best models: Acc@QA ~71–74%, Acc@GQA ~30% (Dang et al., 22 Jun 2025).

4. Quantitative Results and Comparative Analysis

GVQA systems consistently outperform non-grounded baselines, especially on causal/temporal and complex reasoning questions.

Example performance comparison (NExT-QA, iVQA, ActivityNet-QA):

Model	NExT-QA	iVQA	ActivityNet-QA
VLM only	79.5%	69.1%	46.6%
SG-VLM (FrameSel)	83.6%	76.9%	52.7%
MUPA-7B (Acc@GQA)	30.3%	—	—
TOGA (mIoU)	24.4%	—	—
EIGV (Co-Mem)	50.7%	—	—
LGQAVE	66.7%	—	—

Model ablations reveal that joint answer+grounding training yields 2–4% better accuracy than two-stage approaches. The inclusion of multi-scale or causally-aware frame selection and grounding modules (e.g., scene graphs, STR, moment detectors) yields the greatest relative improvement for challenging benchmarks.

5. Methodological Innovations and Interpretability

GVQA research advances the field with several methodological innovations:

Multi-path and agentic architectures: Combining several reasoning paths and explicit reflection agents improves grounding fidelity, enabling models to self-verify predictions (Dang et al., 22 Jun 2025).
Symbolic scene-graph and object-level reasoning: Directly injects interpretable object and relation representations, supporting causal, spatial, and temporal reasoning (Ma et al., 15 Sep 2025).
Contrastive and alignment losses: Links question-relevant frames to specific textual events, closing the gap between attention-only and mask-based grounding (Xiao et al., 2023, Wang et al., 2024).
Intrinsic explainability via causal masking: EIGV/IGV enforce groundings that are robust to spurious correlations, rendering the visual-rationale computation transparent (Li et al., 2022, Li et al., 2022).
Trigger-moment selection via LLM reasoning: Zero-shot LLMs are prompted to pinpoint the single best frame for spatial anchoring, yielding state-of-the-art tracking accuracy (Seo et al., 4 Nov 2025).

Qualitative case studies, e.g., STR (Li et al., 2023), show that correct grounding removes answer confusion and boosts performance on long videos or those with overlapping events/objects.

6. Limitations, Controversies, and Future Directions

Despite substantial progress, critical limitations persist:

Grounding accuracy remains far below human level; spurious correlations and dataset biases still hinder out-of-distribution generalization.
Noise in pseudo-labels and CLIP-based supervision: Weakly-supervised approaches depend on CLIP or other LLMs for event discovery; imperfect event descriptions or missed moments reduce precision (Wang et al., 2024).
Restriction to continuous segments: Most models output a single interval per QA; multi-event or multi-argument queries are not handled natively.
Object tracking and occlusion: Small or occluded object tracking—critical in tasks like surgical QA or container tracking—require further advances in spatio-temporal association (Seo et al., 4 Nov 2025).
Scene-text recognition bottleneck: In TextVideoQA, OCR errors dominate, and joint training for detection, recognition, and QA could reduce error rates (Zhou et al., 2024).
Evaluative metrics: String-match evaluation fails to capture partial correctness or semantic alignment; richer evaluation (human-in-the-loop, relaxed metrics) is needed for real-world adoption.

Emerging directions include end-to-end trainable scene graph and tubelet extraction, multi-modal contrastive learning with audio/subtitle integration, hierarchical entailment trees for commonsense inference (Liu et al., 9 Jan 2025), and query-focused interval selection via foundation models (Rongali et al., 2024).

In summary, Grounded Video Question Answering has rapidly evolved from black-box prediction to interpretable, causally faithful inference with explicit visual evidence. Contemporary research leverages symbolic, agentic, and alignment-driven architectures to bridge the gap between video, language, and grounding, yielding both state-of-the-art accuracy and transparent model rationale. Ongoing challenges around annotation noise, multi-event handling, and evaluation remain active areas for innovation.