Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Published 9 Apr 2026 in cs.CV | (2604.08014v2)

Abstract: Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal LLMs (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM's temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces Bridge-STG, which decouples spatial and temporal localization through explicit temporal alignment and semantic bridging queries.
It employs multi-layer query aggregation and contrastive loss to enhance both temporal reasoning and precise spatial decoding in video grounding.
Experiments show substantial improvements in m_vIoU and m_tIoU across benchmarks, demonstrating both efficiency and robustness in real-world applications.

Decoupling Spatio-Temporal Alignment for Fine-Grained Video Grounding

Problem Formulation and Limitations of Existing Approaches

The paper introduces the task of Spatio-Temporal Video Grounding (STVG), which requires the localization of target objects across both temporal and spatial dimensions, based on natural language queries. Traditional Multimodal LLMs (MLLMs) are limited by two fundamental challenges:

Entangled Spatio-Temporal Alignment: Existing autoregressive MLLM architectures couple temporal and spatial localization within a unified output space. This conflation fails to leverage specialized strengths—MLLMs excel at sequence-level temporal reasoning, whereas spatial localization demands pixel-precise coordinate regression not inherently suited to LLMs. The joint output space exponentially increases complexity as event durations, scene transitions, and object appearances fluctuate throughout real-world videos.
Dual-Domain Visual Token Redundancy: STVG differs from image grounding due to both temporal sparsity (objects appear only in specific frames) and spatial sparsity (relevant objects occupy localized regions). As a result, most visual tokens extracted from dense sampling are irrelevant, degrading multimodal alignment and spatial localization fidelity.

Previous attempts to solve spatio-temporal grounding have largely relied on coupled architectures or task-specific models, often yielding ambiguous temporal boundaries, spatial misalignment, and susceptibility to distractors.

Bridge-STG Architecture: Decoupled Alignment with Semantic Bridging

The proposed Bridge-STG framework explicitly decouples temporal and spatial localization, addressing both architectural and semantic isolation. The design consists of two principal modules:

Spatio-Temporal Semantic Bridging (STSB) with Explicit Temporal Alignment (ETA)

Instead of forcibly intertwining spatial and temporal tasks within an autoregressive sequence, Bridge-STG inserts text-formatted timestamp tokens after each frame pair's visual tokens. These timestamp tokens are projected into virtual spatial coordinates outside the visual token grid, preserving positional embedding coherence while providing event-boundary anchoring. Subsequently, STSB leverages a set of learnable bridging queries, distilling temporally enriched reasoning context from the MLLM into robust semantic features. These features serve as the interface between temporal localization and spatial decoding, allowing for cooperative optimization despite decoupling.

Query-Guided Spatial Localization (QGSL) with Multi-Layer Interactive Queries

The spatial grounding stage is handled by a dedicated QGSL module. Semantic bridging queries from STSB act as prompts to drive a deformable spatial decoder. To overcome redundancy and ensure fine-grained localization, QGSL aggregates candidate queries across all image encoder layers (not just the last layer) via cosine similarity selection, enriching spatial diversity particularly for small or occluded objects. Furthermore, the architecture incorporates positive/negative frame sampling in training, exposing the decoder to relevant (positive) and irrelevant (negative) frames, enabling robust discrimination against distractors. Contrastive alignment loss supervises image-query selection, ensuring semantic and visual coherence.

Experiments: Quantitative and Qualitative Evaluations

Extensive experiments validate Bridge-STG across diverse benchmarks:

On VidSTG, Bridge-STG improves average m_vIoU from 26.4 to 34.3, outperforming all prior MLLM-based approaches and closing the gap to task-specific models.
On HCSTVG-v2, Bridge-STG achieves 64.1 m_tIoU and 41.5 m_vIoU, with particularly large improvements under strict localization metrics.
On Video Temporal Grounding and Object Tracking benchmarks (Charades-STA, GOT-10K), the architecture exhibits strong cross-task generalization, outperforming models optimized for pure temporal or spatial tasks.
Referring Expression Comprehension (REC) and Video Question Answering (VQA) evaluations demonstrate the model's preserved spatial and reasoning capacity.
Ablation studies confirm the criticality of explicit temporal anchoring, semantic bridging queries, P/N frame sampling, multi-layer query aggregation, and contrastive image-query alignment.
Oracle temporal window experiments reveal that further gains in spatial grounding are attainable with improved temporal localization, given the strong performance of the QGSL decoder.

Practical and Theoretical Implications

The decoupled architecture yields not only accuracy improvements but also substantial reductions in token and frame processing latency, ultimately enabling faster inference and lower memory usage compared to MLLM baselines. The modularity of Bridge-STG provides robustness against task-specific dataset biases and facilitates transfer to broader video understanding tasks under unified multi-task training regimes.

From a theoretical perspective, the explicit semantic bridging mechanism establishes a general paradigm for overcoming architectural isolation in multimodal systems. The formal decoupling enables independent optimization of temporal and spatial modules while preserving end-to-end gradient flow, mitigating the traditional tradeoff between specialization and expressivity.

Limitations and Future Directions

Bridge-STG's reliance on fixed frame sampling (2 fps) constrains its effectiveness in fast-motion or short-duration events. Cascaded error propagation from temporal to spatial modules introduces vulnerability when temporal predictions are inaccurate. Computational overhead, while mitigated compared to dense autoregression, remains higher due to the introduction of a dedicated spatial decoder.

Future research directions include adaptive frame sampling strategies conditioned on motion or event density, joint temporal-spatial decoding mechanisms to reduce cascade errors, and parameter-efficient designs or knowledge distillation for the spatial decoder to facilitate deployment in resource-constrained settings.

Conclusion

The Bridge-STG framework presents an authoritative solution to entangled spatio-temporal alignment and visual token redundancy in STVG. Through explicit semantic bridging, multi-layer query aggregation, and discriminative frame sampling, it achieves state-of-the-art performance on both spatial and temporal grounding tasks, delivering robust transferability and practical efficiency. Its architectural innovations provide a foundation for future research in modular video understanding, joint reasoning paradigms, and scalable multimodal grounding (2604.08014).

Markdown Report Issue