GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

Published 2 Apr 2026 in cs.CV | (2604.02093v1)

Abstract: Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video LLMs (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel query-guided visual token sampling module to enhance video temporal grounding in multimodal LLMs.
It employs a differentiable top-K masking with STE and Gumbel-Softmax relaxation for adaptive, token-level selection based on query relevance.
Experimental results demonstrate significant gains on benchmarks, with improved token efficiency and precise temporal localization.

GroundVTS: Query-Guided Visual Token Sampling for Video Temporal Grounding

Motivation and Problem Statement

The paper "GroundVTS: Visual Token Sampling in Multimodal LLMs for Video Temporal Grounding" (2604.02093) addresses the challenge of Video Temporal Grounding (VTG) in Vid-LLMs. VTG requires precise localization of video segments that correspond to natural-language queries, demanding fine-grained temporal understanding. Existing Vid-LLMs typically utilize uniform frame sampling, distributing input across time without consideration for query-specific relevance, which results in dilution or even omission of salient temporal cues.

Prior works have attempted coarse query-guided frame selection, often relying on external encoders for similarity estimation. Such methods are limited in granularity and adaptability, hindering accurate temporal localization.

Figure 1: Schematic comparison of frame sampling strategies: uniform sampling, external encoder-based query-guided sampling, and GroundVTS’s internal token-level query-guided sampling.

Methodology: Grounded Visual Token Sampling

GroundVTS introduces a differentiable, query-guided Visual Token Sampling (VTS) module integrated into the Vid-LLM pipeline, post visual encoding and multimodal projection. The VTS module evaluates visual token-query relevance at the token level by projecting visual features and query embeddings into lower-dimensional subspaces and computing temperature-scaled dot-product similarity. This produces a relevance distribution over visual tokens.

To select the most informative tokens, GroundVTS employs a differentiable top- $K$ masking operation using Straight-Through Estimator (STE) with Gumbel-Softmax relaxation. This enables end-to-end learning and adaptive, query-specific selection of visual content. Spatial and temporal position encodings are retained only for selected tokens, preserving coherence for token-based temporal reasoning.

Figure 2: GroundVTS architecture illustrating multimodal input processing, VTS module, and token selection pipeline.

Training and Optimization Strategy

A three-stage progressive optimization strategy facilitates stable and effective integration of VTS:

VTS Warm-Up: The VTS module is trained in isolation with all other components frozen, establishing stable query-conditioned token selection.
Joint LoRA Adaptation: Fine-tuning of both VTS and multimodal projector, and adaptation of the LLM via Low-Rank Adaptation (LoRA), leveraging large-scale multimodal video instruction datasets.
Grounding Fine-Tuning: Final stage fine-tuning on a unified VTG dataset (Grounding-FT) constructed by reformulating expert VTG datasets into instruction-style QA pairs, promoting generalizable temporal grounding and reasoning.

Frame Rate Sensitivity Analysis

A preliminary study reveals that VTG performance in Vid-LLMs depends non-linearly on visual token density. When sampling is sparse, temporal cues are lost; when too dense, redundant tokens dilute signal quality and degrade accuracy. Peak performance is observed at an optimal sampling density, motivating adaptive mechanisms for token selection.

Figure 3: Analysis of Qwen2.5VL-7B frame rate sensitivity demonstrates optimal performance at moderate visual token density.

Experimental Results

GroundVTS exhibits substantial improvements in VTG benchmarks. For moment retrieval on Charades-STA, GroundVTS-Q achieves +24.8 [email protected] and +18.4 mIoU over the Qwen2.5VL-7B-G baseline, reaching 57.5 [email protected] and 50.1 mIoU. Similar gains are reported on ActivityNet-Captions and QVHighlights, including highlight detection where GroundVTS-I improves mAP and Hit@1 by significant margins.

Ablation studies show that token-level sampling and the retention of position encodings are essential for performance. Uniform or random sampling severely degrades accuracy, and removal of positional encoding collapses VTG results. Furthermore, GroundVTS maintains high grounding accuracy and token efficiency across a wide range of input densities, outperforming baselines even under sparse sampling conditions.

Figure 4: Impact of visual token density on VTG performance (top) and token efficiency (bottom), demonstrating GroundVTS's stability and robustness.

Figure 5: Qualitative comparison illustrating temporal grounding predictions: GroundVTS-Q aligns token density peaks with ground-truth intervals, outmatching baseline models.

Theoretical and Practical Implications

GroundVTS advances temporal grounding in Vid-LLMs by moving token selection into the LLM pipeline, enabling efficient, query-aware, non-uniform allocation of visual evidence. This mechanism achieves both high precision and robustness, mitigating sensitivity to input density and supporting practical VTG tasks across diverse temporal structures. Its progressive optimization ensures compatibility and stable adaptation in instruction-tuned Vid-LLMs.

Theoretical implications include the importance of token-level relevance modeling and the necessity of temporal positional encoding. GroundVTS’s differentiable sampling strategy opens avenues for efficient long-video reasoning and further exploration into multi-stage or iterative token selection architectures.

Future Directions

Potential future advances include multi-stage token sampling exploiting deeper cross-modal semantics, scaling GroundVTS for hour-long videos, and expanded assessment on broader multimodal tasks. Integration with other token compression or pruning paradigms may yield additional efficiency gains. Research into interpretability and transparency of token selection could provide further insight into grounded video-language reasoning.

Conclusion

GroundVTS provides a principled, query-guided visual token sampling framework, integrated via a progressive optimization strategy for VTG in Vid-LLMs. Its VTS module enables fine-grained, robust temporal grounding, outperforming current state-of-the-art methods and demonstrating resilience across input densities and out-of-distribution settings. This establishes GroundVTS as an effective solution for practical and theoretical challenges in video-language modeling.

Markdown Report Issue