VUE-STG: Spatio-Temporal Grounding Benchmark
- VUE-STG is a comprehensive benchmark for spatio-temporal grounding, featuring 982 videos and 1,600 query-tube pairs that enable long-context reasoning.
- It reformats full-sentence queries into descriptive noun phrases, reducing ambiguity and enhancing precise localization of objects in space and time.
- The benchmark introduces unified spatio-temporal metrics and rigorous manual annotations, setting a higher standard over prior datasets in video understanding.
VUE-STG is a benchmark for evaluating spatio-temporal grounding (STG) in video understanding, introduced in the context of Vidi2's multimodal modeling framework. It is designed to overcome critical limitations in prior STG datasets, providing longer video contexts, expressively reformatted queries, rigorous manual annotation, and unified spatio-temporal evaluation metrics. The benchmark enables the assessment of systems that jointly localize objects in space and time given natural language queries, supporting comprehensive research on multimodal reasoning in large-scale video.
1. Dataset Composition and Video Statistics
VUE-STG comprises 982 distinct videos sourced from public domains, collectively spanning 204.79 hours. Each video is paired with at least one spatio-temporal tube annotation corresponding to a text query, totaling 1,600 unique query-tube pairs. To ensure the benchmark supports long-context reasoning, video durations range from roughly 10 seconds to 30 minutes. Videos are categorized by duration as follows:
| Category | Duration (hours) |
|---|---|
| Ultra-short (<1 min) | 0.82 |
| Short (1–10 min) | 26.28 |
| Medium (10–30 min) | 177.69 |
| Total | 204.79 |
Tubes are annotated at one frame per second and exhibit a broad spectrum of temporal coverage, predominately between 3–10 seconds but extending from under 3 seconds to 60 seconds. Object sizes, expressed as average bounding box area percentage of frame, are carefully balanced among small (<10%), medium (10–30%), and large (>30%) objects, eliminating scale bias in object localization (Team et al., 24 Nov 2025).
2. Query Reformatting and Expressiveness
VUE-STG transforms original full-sentence queries into richly descriptive noun phrases or head-noun constructions with relative clauses, without sacrificing sentence-level expressiveness. This reformatting both clarifies the annotation target and retains temporal cues, adjectives, and other information crucial for reasoning. For example:
- Original: “a player is loaded into an ambulance” Converted (for ambulance target): “the ambulance which the player is being loaded into” Converted (for player target): “the player who is loaded into the ambulance”
- Original: “a man standing up from a kneeling position” Converted: “the man who is standing up from a kneeling position”
These conversions reduce query ambiguity and concentrate model attention on a single referent, crucial for precise grounding (Team et al., 24 Nov 2025).
3. Annotation Methodology and Quality Assurance
Annotations are fully manual, with multi-pass quality control, and no reliance on automatic proposal mechanisms. The annotation process follows several guidelines:
- Each query undergoes human verification and rephrasing to disambiguate references when multiple objects are mentioned.
- Each annotated tube corresponds to a single query/object, with unambiguous mapping.
- Temporal bounds for tubes are defined in seconds (start/end), discretized at 1 Hz.
- Spatial tubes provide one normalized bounding box per second, denoted in [0,1] relative frame coordinates.
- Temporally fragmented tubes caused by occlusion or shot transitions are preserved, not forced to be contiguous.
- All annotation stages are subject to iterative review, yielding “significantly more accurate and consistent labels than existing benchmarks,” though no explicit IAA numbers are published (Team et al., 24 Nov 2025).
4. Evaluation Metrics
VUE-STG introduces a unified, rigorous scheme for spatial, temporal, and spatio-temporal evaluation:
Spatial IoU (bIoU):
Given boxes :
Temporal Metrics:
For predicted and ground-truth intervals:
- Temporal Precision:
- Temporal Recall:
- Temporal Intersection over Union:
Frame-level IoU:
For all (sampled per second):
- if , else $0$.
Spatio-temporal Metrics:
- Spatio-temporal Precision:
- Spatio-temporal Recall:
- vIoU (over temporal union):
- vIoU-Int. (over temporal intersection):
The primary ranking metric is vIoU, offering a stringent and unified measure for joint spatial and temporal alignment (Team et al., 24 Nov 2025).
5. Comparative Advances over Previous STG Benchmarks
VUE-STG introduces four principal improvements:
- Extended Video Duration: Supports context windows up to 30 minutes per video, in contrast to prior datasets (Charades-STA, ActivityNet-Caps) which typically cap at 1–2 minutes.
- Query Expressiveness: Employs free-form noun-phrase queries with complex structures, whereas previous datasets often restrict queries to keywords or template-based expressions.
- Manual Annotation Accuracy: All temporal bounds and bounding boxes undergo meticulous human curation, setting VUE-STG apart from prior efforts relying on automated cues or single-frame supervision.
- Refined Spatio-temporal Metrics: Evaluates models using vP, vR, vIoU, and vIoU-Int, jointly reflecting both temporal and spatial alignment, while most historic benchmarks report only temporal IoU or frame-level accuracy (Team et al., 24 Nov 2025).
6. Related Upgrades: VUE-TR-V2 and Enhanced Modalities
The release of VUE-STG is accompanied by VUE-TR-V2 (“Temporal Retrieval” version 2), which revises the previous VUE-TR benchmark to further mitigate dataset bias and address diverse user scenarios:
- Total duration increased from 107.87 to 311.11 hours.
- Additional long (30–60 min) and ultra-long (>60 min) videos incorporated to counteract short-video overrepresentation.
- Queries expressed in more natural, free-form language, with coverage of vision-only, audio-only, and combined prompts.
- Query formats now include both phrases and complete sentences rather than isolated keywords, improving alignment with real-world video understanding needs (Team et al., 24 Nov 2025).
7. Benchmark Results and System Comparison
Evaluation of leading models on VUE-STG and VUE-TR-V2 demonstrates substantial variance among systems (Tables 5–7 in (Team et al., 24 Nov 2025)). For VUE-STG (primary metric: vIoU):
| Model | vIoU (%) | vIoU-Int. (%) | vP (%) | vR (%) |
|---|---|---|---|---|
| Vidi2 | 32.57 | 60.30 | 44.56 | 36.32 |
| Gemini 3 Pro | 4.61 | 16.59 | 13.01 | 8.95 |
| GPT-5 | 5.47 | 18.47 | 13.01 | 6.50 |
| Qwen3-VL-32B | 5.12 | 8.61 | 8.61 | 7.49 |
For VUE-TR-V2:
| Model | IoU-AUC (%) | P-AUC (%) | R-AUC (%) |
|---|---|---|---|
| Vidi2 | 48.75 | 62.45 | 61.38 |
| Gemini 3 Pro | 37.58 | 48.61 | 57.26 |
| GPT-5 | 17.15 | 29.64 | 28.04 |
Vidi2 achieves markedly higher vIoU and overall retrieval performance than Gemini 3 Pro, GPT-5, and Qwen3-VL-32B, highlighting the competitiveness of open-source, large-multimodal systems and the discriminative power of the STG benchmark (Team et al., 24 Nov 2025).
A plausible implication is that VUE-STG’s expanded video duration, refined query formulation, and human-consistent annotations create a higher bar for multimodal video understanding—enabling more granular diagnostic evaluation of model capabilities and facilitating progress toward generalized, long-context video reasoning.