Papers
Topics
Authors
Recent
2000 character limit reached

Open-Vocabulary Spatio-Temporal Video Grounding

Updated 9 December 2025
  • Open-vocabulary spatio-temporal video grounding is a task that localizes objects in both time and space within untrimmed videos based on free-form natural language queries.
  • Recent approaches integrate transformer-based DETR models, MLLM techniques, and modular attention networks to address challenges like unseen categories and complex spatial-temporal relations.
  • Benchmark evaluations using metrics such as m_tIoU and m_vIoU demonstrate significant performance gains from innovations like chain-of-thought prompting, temporal adapters, and training-free pipelines.

Open-vocabulary spatio-temporal video grounding (OV-STVG) is the task of localizing, both in time and space, a target object or entity in untrimmed videos according to free-form natural language queries, without restricting the vocabulary of objects, actions, or relations to a closed set observed during training. OV-STVG requires models to handle unseen object categories, compositional phrases, rare actions, and multi-level referential language at inference without class-specific heads or task-specific retraining. Research in this area has rapidly evolved, driven by new datasets, algorithmic advances, and pre-trained vision–language foundation models.

1. Formal Problem Definition and Benchmarking

The core input to OV-STVG is a video V={f1,...,fNv}V = \{f_1, ..., f_{N_v}\} and an open-vocabulary textual query Q={w1,...,wNt}Q = \{w_1, ..., w_{N_t}\}, where QQ may reference arbitrary objects, actions, attributes, or spatial/temporal relations. The desired outputs are:

  • A continuous temporal segment S^=[t^s,t^e][1,Nv]\hat{S} = [\hat{t}_s, \hat{t}_e] \subset [1, N_v]
  • A spatio-temporal tube B^={b^tR4t=t^s,...,t^e}\hat{B} = \{\hat{b}_t \in \mathbb{R}^4 \mid t = \hat{t}_s,...,\hat{t}_e\}, where each b^t\hat{b}_t is a bounding box

Evaluation metrics typically include:

  • Mean Temporal IoU (m_tIoU): m_tIoU=1Mi=1M[t^si,t^ei][tsi,tei][t^si,t^ei][tsi,tei]\mathrm{m\_tIoU} = \frac{1}{M} \sum_{i=1}^M \frac{|[\hat{t}_s^i, \hat{t}_e^i] \cap [{t}_s^i, {t}_e^i]|}{|[\hat{t}_s^i, \hat{t}_e^i] \cup [{t}_s^i, {t}_e^i]|}
  • Mean Video IoU (m_vIoU): m_vIoU=1Mi=1M1t^eit^si+1t=t^sit^eiIoU(b^ti,bti)\mathrm{m\_vIoU} = \frac{1}{M}\sum_{i=1}^M \frac{1}{|\hat{t}_e^i - \hat{t}_s^i+1|} \sum_{t=\hat{t}_s^i}^{\hat{t}_e^i} \mathrm{IoU}(\hat{b}_t^i, b_t^i)
  • vIoU@R: fraction of samples with vIoUR\mathrm{vIoU} \geq R

These metrics are adopted in major benchmarks such as HC-STVG, VidSTG, and OmniGround (Gao et al., 21 Nov 2025, Gu et al., 3 Jan 2024, Zhang et al., 2020).

OmniGround establishes a large-scale, open-vocabulary evaluation corpus of 3,475 videos and 81 categories, with rigorous metrics for annotation quality and linguistic diversity (e.g., Normalized Entropy Index, Cross-Modal Alignment Score, Verb-Spatial Balance Index, and Foreground Complexity Index), and presents challenges specifically tailored to small/occluded objects and complex queries (Gao et al., 21 Nov 2025).

2. Model Architectures and Open-Vocabulary Mechanisms

2.1 Transformer-based DETR-style Models

Recent one-stage models such as STCAT (Jin et al., 2022), CG-STVG (Gu et al., 3 Jan 2024), and VideoGrounding-DINO (Wasim et al., 2023) use transformer encoder–decoder architectures with joint cross-modal attention. STCAT introduces a global/local multi-modal template in the query-guided decoder to enforce consistent tube predictions across frames, directly regressing bounding boxes without proposal heads (Jin et al., 2022). CG-STVG mines and propagates instance context at each decoding stage via modules for context generation and refinement, feeding visual context as cross-attention guidance (Gu et al., 3 Jan 2024). VideoGrounding-DINO leverages pre-trained image–text spatial modules (Grounding DINO) and integrates temporal aggregation adapters, freezing major backbone weights for open-vocabulary transfer (Wasim et al., 2023).

2.2 MLLM-Based Approaches

Multimodal LLMs (MLLMs) such as SpaceVLLM (Wang et al., 18 Mar 2025), STVG-o1 (Gu et al., 26 Nov 2025), and DEViL (Gao et al., 7 Dec 2025) embed video frames and queries into a joint space using a pre-trained LLM (e.g., Qwen2), sometimes with minimal architecture changes. SpaceVLLM introduces interleaved spatio-temporal aware queries and a Query-Guided Space Decoder, trained on a synthetic 480K-instance dataset (Uni-STG) that fuses temporal, spatial, and joint spatio-temporal tasks (Wang et al., 18 Mar 2025). STVG-o1 employs a bounding-box chain-of-thought prompting scheme with reinforcement learning, optimizing a multi-dimensional reward (format, consistency, temporal, spatial, improvement/“think” reward) for fine-grained, geometry-aware supervision (Gu et al., 26 Nov 2025). DEViL couples the MLLM with an open-vocabulary detector via a Reference-Semantic Token (RST), projecting LLM features into detector-class embeddings, and enforces tube-level temporal regularization (TTReg) for temporally-stable localization (Gao et al., 7 Dec 2025).

2.3 Weakly-supervised and Modular Attention Models

Earlier systems such as the two-stream modular attention network (Wiriyathammabhum et al., 2019) disentangle appearance and motion through parallel modules, with explicit language–vision matching for subject, location, and relationship. Weakly-supervised frameworks, including WSSTG (Chen et al., 2019) and TubeRMC (Li et al., 13 Nov 2025), rely on instance proposal generation, cross-modal attentive interaction/ranking, and tube-conditioned masked-language reconstruction to align spatio-temporal hypotheses to free-form queries without dense supervision.

3. Dataset Construction and Challenges

Open-vocabulary STVG requires datasets that (1) maximize category diversity, (2) minimize label bias and shortcut learning, and (3) support compositional and relational queries. Key benchmarks for OV-STVG include:

Benchmarking reveals that closed-set models overfit to head classes and lack robustness on rare/unseen objects, linguistically rich queries, and complex spatial/temporal configurations (Gao et al., 21 Nov 2025). OmniGround’s VSBI metric directly quantifies linguistic balance, while NEI captures category coverage.

4. Training Paradigms, Loss Functions, and Evaluation

Open-vocabulary STVG models leverage:

  • Joint cross-modal regression/classification: DETR-style models minimize combinations of L1 and IoU/GIoU for boxes, plus Kullback–Leibler divergence or BCE for temporal membership (Jin et al., 2022, Wasim et al., 2023).
  • Reconstruction and contrastive objectives: TubeRMC uses three coupled reconstructors for spatial, temporal, and spatio-temporal query masking, paired with inter/intra-proposal contrastive and mutual-consistency losses to promote tube–sentence alignment (Li et al., 13 Nov 2025).
  • Reinforcement learning: STVG-o1 optimizes multi-component geometric rewards via Group Relative Policy Optimization, extracting and aligning chain-of-thought and final tube predictions (Gu et al., 26 Nov 2025).
  • Transfer learning: VideoGrounding-DINO freezes large-scale image–text spatial backbones, and SpaceVLLM and DEViL adapt MLLMs via open-vocabulary detectors, custom queries, and auxiliary modules (Wasim et al., 2023, Wang et al., 18 Mar 2025, Gao et al., 7 Dec 2025).

Zero-shot and cross-domain evaluation are standard; VideoGrounding-DINO and STVG-o1 report substantial gains over closed-set or direct-finetuning baselines in challenging OV-STVG settings (Wasim et al., 2023, Gu et al., 26 Nov 2025).

Table: Representative Model Performance on HC-STVG-v1 (test, percentages)

Model m_tIoU m_vIoU [email protected] [email protected]
TubeDETR 43.7 32.4 49.8 23.5
STCAT 49.4 35.1 57.7 30.1
CG-STVG 52.8 38.4 61.5 36.3
SpaceVLLM-7B 56.9 39.3 66.6 36.9
STVG-o1 60.3 44.1 73.3 43.5
DEViL (fine-tuned) 54.7 36.2 - -
TubeRMC (WS) - 19.4 23.9 6.75

Performance gains in m_tIoU and m_vIoU correlate with:

PG-TAF, a training-free pipeline that decouples LLM-based temporal inference and CLIP-tracker spatial propagation, demonstrates +25.6% (absolute) m_tIoU and +35.6% m_vIoU improvements on OmniGround, with extraordinary robustness to small/occluded objects and long-tail queries (Gao et al., 21 Nov 2025).

6. Open Challenges and Future Directions

Three principal challenges define OV-STVG:

  • Category and domain shift: Open-vocabulary generalization exposes models’ tendency to overfit to head/seen classes and collapse on rare categories or complex spatial arrangements (Gao et al., 21 Nov 2025).
  • Linguistic and relational compositionality: Extant architectures struggle with queries containing nested relations, role-based disambiguation (e.g., “the man in blue shirt behind the car on the right”), and chained reasoning (Gao et al., 21 Nov 2025, Gao et al., 7 Dec 2025).
  • Scalability and efficiency: Linearly growing computation/memory with frame count, especially in transformer/MMLM-based pipelines (SpaceVLLM (Wang et al., 18 Mar 2025)).

Future directions suggested in the literature include multi-RST or multi-entity grounding for referential chains, adaptive frame selection for long videos, explicit causal/relational and compositional grounding, dataset design for NEI~1.0, and bridging foundation model vision–text representations with pixel/instance-level localization (Gao et al., 7 Dec 2025, Gao et al., 21 Nov 2025, Wang et al., 18 Mar 2025). Modular and decoupled architectures (e.g., PG-TAF) indicate a practical path for leveraging LLMs’ open-vocabulary capacity with vision models’ spatial fidelity.

7. Representative Innovations and Model Comparisons

Model/Framework Key Innovations Open-Vocab Support Performance Trend
STCAT (Jin et al., 2022) Global/local multi-modal template, self-attn Text-driven, detector-free SOTA 2022
CG-STVG (Gu et al., 3 Jan 2024) Instance context mining/refinement modules Instance-visual context +2 m_tIoU vs. SOTA
VideoGrounding-DINO (Wasim et al., 2023) Frozen Grounding DINO, temporal adapters Foundation image-text +4.88 m_vIoU over STCAT
SpaceVLLM (Wang et al., 18 Mar 2025) Interleaved queries, Query-guided Space Decoder MLLM, no class head SOTA (2025)
STVG-o1 (Gu et al., 26 Nov 2025) RL “think with boxes” chain-of-thought Direct MLLM, RL rewards +7.3% m_tIoU over task-SOTA
DEViL (Gao et al., 7 Dec 2025) RST, OVD coupling, TTReg RST-driven OV detector SOTA, tube stability
PG-TAF (Gao et al., 21 Nov 2025) Training-free, LLM + CLIP-tracker pipeline LLM text, tracker pixel +25.6% m_tIoU
TubeRMC (Li et al., 13 Nov 2025) Tube-conditioned multi-task reconstruction Foundation + recon loss SOTA WS (2025)

The field of open-vocabulary spatio-temporal video grounding is thus defined by the interplay between large-scale pretrained models, spatio-temporal reasoning in both the visual and linguistic domains, and increasingly rigorous open-domain evaluation. Ongoing research aims to further close the gap between generalist and specialist architectures for robust, fine-grained real-world video understanding.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Open-Vocabulary Spatio-Temporal Video Grounding.