Papers
Topics
Authors
Recent
2000 character limit reached

VUE-STG: Spatio-Temporal Grounding Benchmark

Updated 30 November 2025
  • VUE-STG is a comprehensive benchmark for spatio-temporal grounding, featuring 982 videos and 1,600 query-tube pairs that enable long-context reasoning.
  • It reformats full-sentence queries into descriptive noun phrases, reducing ambiguity and enhancing precise localization of objects in space and time.
  • The benchmark introduces unified spatio-temporal metrics and rigorous manual annotations, setting a higher standard over prior datasets in video understanding.

VUE-STG is a benchmark for evaluating spatio-temporal grounding (STG) in video understanding, introduced in the context of Vidi2's multimodal modeling framework. It is designed to overcome critical limitations in prior STG datasets, providing longer video contexts, expressively reformatted queries, rigorous manual annotation, and unified spatio-temporal evaluation metrics. The benchmark enables the assessment of systems that jointly localize objects in space and time given natural language queries, supporting comprehensive research on multimodal reasoning in large-scale video.

1. Dataset Composition and Video Statistics

VUE-STG comprises 982 distinct videos sourced from public domains, collectively spanning 204.79 hours. Each video is paired with at least one spatio-temporal tube annotation corresponding to a text query, totaling 1,600 unique query-tube pairs. To ensure the benchmark supports long-context reasoning, video durations range from roughly 10 seconds to 30 minutes. Videos are categorized by duration as follows:

Category Duration (hours)
Ultra-short (<1 min) 0.82
Short (1–10 min) 26.28
Medium (10–30 min) 177.69
Total 204.79

Tubes are annotated at one frame per second and exhibit a broad spectrum of temporal coverage, predominately between 3–10 seconds but extending from under 3 seconds to 60 seconds. Object sizes, expressed as average bounding box area percentage of frame, are carefully balanced among small (<10%), medium (10–30%), and large (>30%) objects, eliminating scale bias in object localization (Team et al., 24 Nov 2025).

2. Query Reformatting and Expressiveness

VUE-STG transforms original full-sentence queries into richly descriptive noun phrases or head-noun constructions with relative clauses, without sacrificing sentence-level expressiveness. This reformatting both clarifies the annotation target and retains temporal cues, adjectives, and other information crucial for reasoning. For example:

  • Original: “a player is loaded into an ambulance” Converted (for ambulance target): “the ambulance which the player is being loaded into” Converted (for player target): “the player who is loaded into the ambulance”
  • Original: “a man standing up from a kneeling position” Converted: “the man who is standing up from a kneeling position”

These conversions reduce query ambiguity and concentrate model attention on a single referent, crucial for precise grounding (Team et al., 24 Nov 2025).

3. Annotation Methodology and Quality Assurance

Annotations are fully manual, with multi-pass quality control, and no reliance on automatic proposal mechanisms. The annotation process follows several guidelines:

  • Each query undergoes human verification and rephrasing to disambiguate references when multiple objects are mentioned.
  • Each annotated tube corresponds to a single query/object, with unambiguous mapping.
  • Temporal bounds for tubes are defined in seconds (start/end), discretized at 1 Hz.
  • Spatial tubes provide one normalized bounding box per second, denoted in [0,1] relative frame coordinates.
  • Temporally fragmented tubes caused by occlusion or shot transitions are preserved, not forced to be contiguous.
  • All annotation stages are subject to iterative review, yielding “significantly more accurate and consistent labels than existing benchmarks,” though no explicit IAA numbers are published (Team et al., 24 Nov 2025).

4. Evaluation Metrics

VUE-STG introduces a unified, rigorous scheme for spatial, temporal, and spatio-temporal evaluation:

Spatial IoU (bIoU):

Given boxes B1,B2B_1, B_2:

bIoU(B1,B2)=Area(B1B2)Area(B1B2)\mathrm{bIoU}(B_1, B_2) = \frac{\text{Area}(B_1 \cap B_2)}{\text{Area}(B_1 \cup B_2)}

Temporal Metrics:

For predicted TpredT_{\text{pred}} and ground-truth TgtT_{\text{gt}} intervals:

  • Temporal Precision: tP=TpredTgt/TpredtP = |T_{\text{pred}} \cap T_{\text{gt}}| / |T_{\text{pred}}|
  • Temporal Recall: tR=TpredTgt/TgttR = |T_{\text{pred}} \cap T_{\text{gt}}| / |T_{\text{gt}}|
  • Temporal Intersection over Union:

tIoU(Tpred,Tgt)=TpredTgtTpredTgt\mathrm{tIoU}(T_{\text{pred}}, T_{\text{gt}}) = \frac{|T_{\text{pred}} \cap T_{\text{gt}}|}{|T_{\text{pred}} \cup T_{\text{gt}}|}

Frame-level IoU:

For all tZt \in \mathbb{Z} (sampled per second):

  • IoUt=bIoU(Bpred(t),Bgt(t))\mathrm{IoU}_t = \mathrm{bIoU}(B_{\mathrm{pred}}(t), B_{\mathrm{gt}}(t)) if tTpredTgtt \in T_{\text{pred}} \cap T_{\text{gt}}, else $0$.

Spatio-temporal Metrics:

  • Spatio-temporal Precision:

vP=1TpredtTpredIoUtvP = \frac{1}{|T_{\text{pred}}|} \sum_{t \in T_{\text{pred}}} \mathrm{IoU}_t

  • Spatio-temporal Recall:

vR=1TgttTgtIoUtvR = \frac{1}{|T_{\text{gt}}|} \sum_{t \in T_{\text{gt}}} \mathrm{IoU}_t

  • vIoU (over temporal union):

vIoU=1TpredTgttTpredTgtIoUt\mathrm{vIoU} = \frac{1}{|T_{\text{pred}} \cup T_{\text{gt}}|} \sum_{t \in T_{\text{pred}} \cup T_{\text{gt}}} \mathrm{IoU}_t

  • vIoU-Int. (over temporal intersection):

vIoU-Int.=1TpredTgttTpredTgtIoUt\mathrm{vIoU\textrm{-}Int.} = \frac{1}{|T_{\text{pred}} \cap T_{\text{gt}}|} \sum_{t \in T_{\text{pred}} \cap T_{\text{gt}}} \mathrm{IoU}_t

The primary ranking metric is vIoU, offering a stringent and unified measure for joint spatial and temporal alignment (Team et al., 24 Nov 2025).

5. Comparative Advances over Previous STG Benchmarks

VUE-STG introduces four principal improvements:

  1. Extended Video Duration: Supports context windows up to 30 minutes per video, in contrast to prior datasets (Charades-STA, ActivityNet-Caps) which typically cap at 1–2 minutes.
  2. Query Expressiveness: Employs free-form noun-phrase queries with complex structures, whereas previous datasets often restrict queries to keywords or template-based expressions.
  3. Manual Annotation Accuracy: All temporal bounds and bounding boxes undergo meticulous human curation, setting VUE-STG apart from prior efforts relying on automated cues or single-frame supervision.
  4. Refined Spatio-temporal Metrics: Evaluates models using vP, vR, vIoU, and vIoU-Int, jointly reflecting both temporal and spatial alignment, while most historic benchmarks report only temporal IoU or frame-level accuracy (Team et al., 24 Nov 2025).

The release of VUE-STG is accompanied by VUE-TR-V2 (“Temporal Retrieval” version 2), which revises the previous VUE-TR benchmark to further mitigate dataset bias and address diverse user scenarios:

  • Total duration increased from 107.87 to 311.11 hours.
  • Additional long (30–60 min) and ultra-long (>60 min) videos incorporated to counteract short-video overrepresentation.
  • Queries expressed in more natural, free-form language, with coverage of vision-only, audio-only, and combined prompts.
  • Query formats now include both phrases and complete sentences rather than isolated keywords, improving alignment with real-world video understanding needs (Team et al., 24 Nov 2025).

7. Benchmark Results and System Comparison

Evaluation of leading models on VUE-STG and VUE-TR-V2 demonstrates substantial variance among systems (Tables 5–7 in (Team et al., 24 Nov 2025)). For VUE-STG (primary metric: vIoU):

Model vIoU (%) vIoU-Int. (%) vP (%) vR (%)
Vidi2 32.57 60.30 44.56 36.32
Gemini 3 Pro 4.61 16.59 13.01 8.95
GPT-5 5.47 18.47 13.01 6.50
Qwen3-VL-32B 5.12 8.61 8.61 7.49

For VUE-TR-V2:

Model IoU-AUC (%) P-AUC (%) R-AUC (%)
Vidi2 48.75 62.45 61.38
Gemini 3 Pro 37.58 48.61 57.26
GPT-5 17.15 29.64 28.04

Vidi2 achieves markedly higher vIoU and overall retrieval performance than Gemini 3 Pro, GPT-5, and Qwen3-VL-32B, highlighting the competitiveness of open-source, large-multimodal systems and the discriminative power of the STG benchmark (Team et al., 24 Nov 2025).

A plausible implication is that VUE-STG’s expanded video duration, refined query formulation, and human-consistent annotations create a higher bar for multimodal video understanding—enabling more granular diagnostic evaluation of model capabilities and facilitating progress toward generalized, long-context video reasoning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to VUE-STG Benchmark.