Online Video Temporal Grounding
- Online Video Temporal Grounding (OnVTG) is a method that aligns natural-language queries with live video frames in real time, adhering to causal and low-latency constraints.
- It leverages explicit temporal representations such as interleaved timestamps, absolute time embeddings, and visual progress markers to efficiently process streaming data.
- OnVTG supports applications like video moderation, VideoQA, and ad violation detection by providing continuously updated, precise interval predictions.
Online Video Temporal Grounding (OnVTG) is the online or streaming variant of video temporal grounding: the core problem remains aligning a natural-language query with temporal segments of a video, but the system must ground the query as frames arrive, without seeing the future. In the broader VTG literature, temporal grounding, video moment retrieval, temporal localization, and segment retrieval are used interchangeably; the online setting preserves that language–time alignment objective while adding causal, low-latency, and long-horizon constraints (Li et al., 23 Jun 2025, Qu et al., 2024).
1. Problem formulation and task scope
A standard VTG formulation is given by UniTime: for an untrimmed video with timestamps and a text query , the task is to find all temporal moments that semantically match the query, where each (Li et al., 23 Jun 2025). In online form, the same alignment problem is constrained by prefix visibility: at time , only frames up to are available, the model should be causal, and it may need to output partial or updated predictions incrementally (Qu et al., 2024).
Recent MLLM-based work has broadened this interval-centric view into event-centric formulations. TRACE models a video as a chronological sequence of events , where denotes timestamps, denotes salient scores, and 0 denotes textual captions; TimeExpert adopts the same structured event representation 1 for moment retrieval, dense video captioning, and highlight detection (Guo et al., 2024, Yang et al., 3 Aug 2025). UniVTG provides an older but still influential unified label space at clip level, assigning each clip a foreground indicator 2, boundary offsets 3, and a saliency score 4 (Lin et al., 2023). This suggests that OnVTG can be formulated either as causal interval prediction over observed prefixes or as autoregressive event prediction over a growing stream.
The practical scope of OnVTG therefore extends beyond single-moment retrieval. In the surveyed literature, dense captioning, highlight detection, and temporally grounded VideoQA all require timestamp prediction under language conditioning, even when the outputs also include captions, scores, or answers. A plausible implication is that OnVTG is best viewed as a family of streaming sequence- or event-grounding problems rather than a single benchmark format.
2. Temporal representations and causal time cues
A central design question in OnVTG is how temporal information is represented. UniTime makes timestamps explicit by interleaving textual timestamp tokens with visual tokens. For short or refined inputs it builds a frame-level sequence
5
where 6 and 7 is the string “timestamp: 8 seconds”; for long videos it switches to segment-level timestamping and a coarse-to-fine hierarchy (Li et al., 23 Jun 2025). Because timestamps are plain text and the model operates on arbitrary subsequences of frames plus their timestamps, the paper explicitly notes conceptual compatibility with online or streaming VTG.
VTG-LLM encodes time in two complementary ways. On the visual side it adds absolute time embedding to visual tokens,
9
where 0 is the integer absolute timestamp in seconds. On the textual side it introduces dedicated absolute-time tokens, such as <TIME_ZERO> through <TIME_NINE> and <TIME_DOT>, with fixed-length formatting for timestamps (Guo et al., 2024). This design separates timestamps from generic numbers, avoids concept shift, and supports timestamps independently of frame IDs. The same paper combines these cues with slot-based token compression, mapping a variable-length token sequence into a fixed-length set of slots, which is explicitly described as conceptually compatible with incremental updates in an online regime.
Other work makes time observable through non-textual cues. VTimeCoT overlays a frame-synchronized progress bar with timestamps in seconds on every frame and adds a highlight tool that draws colored masks over retrieved intervals, then lets an MLLM reason through a visuotemporal chain of thought (Zhang et al., 16 Oct 2025). “MLLMs Know When Before Speaking” identifies sparse Temporal Grounding Heads (TG-Heads) whose prefill attention concentrates on the ground-truth interval, converts this attention into a debiased frame-level relevance signal, and then re-invokes the model on restricted visual context (Du et al., 21 May 2026). These mechanisms suggest that OnVTG can draw temporal evidence from textual timestamps, dedicated time embeddings, visual timelines, or attention-derived saliency.
| Framework | Temporal mechanism | Online relevance |
|---|---|---|
| UniTime | Timestamp-interleaved sequence; adaptive frame scaling | Compatible with sliding-window, clip-based processing |
| VTG-LLM | Absolute time embedding; absolute-time tokens; slot compression | Continuous-time indexing and fixed-size summaries |
| TRACE | Causal event modeling; time/score/text token streams | Event-autoregressive decoding over prefixes |
| VTimeCoT | Progress bar; highlight tool; visuotemporal CoT | Partial progress bars and chunked highlighting |
| TG-Head analysis | Prefill attention read-out; debiased frame relevance | Attention-driven temporal saliency for streaming buffers |
3. Architectural patterns and inference strategies
Most recent MLLM-based VTG systems remain offline in implementation, but they expose several inference patterns that are directly relevant to OnVTG. UniTime uses a fixed 16,384-token context window, samples frames at 2 fps, handles up to 1 frames per clip, and performs long-video localization by partitioning videos into clips and applying a hierarchical coarse-to-fine strategy (Li et al., 23 Jun 2025). VTG-LLM samples 96 frames, enriches them with timestamp knowledge, and compresses visual tokens into 2 slots so that more frames can fit into the LLM context (Guo et al., 2024). TimeExpert compresses each frame to 8 visual tokens and uses sparse MoE routing so that only a subset of experts is activated per token (Yang et al., 3 Aug 2025). This suggests that clip-wise processing, token compression, and hierarchical retrieval are natural building blocks for sliding-window or buffered online grounding.
TRACE makes the causal structure explicit at the output level. It models next-event prediction as
3
and implements this with task-interleaved time, score, and text heads driven by a shared LLM (Guo et al., 2024). The paper stresses that this is causal in the event sequence rather than in wall-clock time, but the factorization is well suited to online use because timestamps can be predicted first, scores next, and captions later if latency permits.
ChatVTG follows a different path. It is training-free and zero-shot, first splitting a full video into equal-length clips, then using a Video Dialogue LLM to generate multi-granularity captions for each clip, then matching those captions to the query with SentenceBERT and cosine similarity, and finally refining with a sliding-window proposal stage (Qu et al., 2024). The current implementation is offline because it uses equal partitioning over the full video and longest consecutive above-threshold spans, but its “caption windows 4 match to query 5 refine” pattern is explicitly described as modular and compatible with streaming windows.
Before MLLM-centric VTG, efficiency-oriented and multimodal backbones already exposed motifs that remain relevant to OnVTG. “Text-Visual Prompting for Efficient 2D Temporal Video Grounding” replaces dense 3D features with sparse 2D features plus prompts and reports 5x inference acceleration over 3D features (Zhang et al., 2023). DRFT combines RGB, optical flow, and depth with dynamic fusion and transformer-based inter-modal learning (Chen et al., 2021). A plausible implication is that online systems may benefit from pairing modern MLLM reasoning layers with lighter causal visual encoders and selective multimodal cues.
4. Supervision, data curation, and reinforcement learning
A dominant MLLM trend is to cast timestamps as generated tokens and optimize language-modeling objectives. UniTime uses a standard causal language modeling loss over target tokens in the textual answer and states that there are no explicit regression or contrastive losses for timestamps (Li et al., 23 Jun 2025). VTG-LLM likewise uses a pure autoregressive objective over time tokens, descriptions, and scores (Guo et al., 2024). TRACE trains with token-level cross-entropy over the interleaved sequence, and TimeExpert uses cross-entropy together with a z-loss and a task-dependent auxiliary MoE loss (Guo et al., 2024, Yang et al., 3 Aug 2025). For OnVTG, this generative formulation is attractive because timestamps, scores, and text can all be produced incrementally as the stream evolves.
At the data level, UniVTG demonstrates that heterogeneous temporal labels can be unified. It maps moment retrieval, highlight detection, and query-focused summarization into clip-level tuples 6, then derives pseudo supervision from point labels, interval labels, and curve labels, yielding a 4.2M-sample pretraining corpus (Lin et al., 2023). This suggests that online data streams with weak labels, narrations, or highlight signals can be transformed into clip-wise temporal supervision even when dense interval annotations are unavailable.
Reinforcement learning has become a major training paradigm for MLLM-based VTG. RAVEN uses Qwen2.5-VL as a reasoning model and trains it with a three-stage curriculum over precise and coarse annotations, combining temporal IoU reward,
7
boundary alignment reward,
8
and category consistency reward (Ji et al., 18 Oct 2025). “Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning” uses SFT on TVG-Coldstart-13K and GRPO on TVG-RL-18K, with a reward of the form
9
where 0 and 1 (Chen et al., 24 Jul 2025). VideoTG-R1 adds a Boundary Reflection Agent to detect partially annotated samples and a Difficulty Estimation Agent to identify hard-to-ground samples, then applies curriculum RL with dynamic masking; its abstract states that, with only 10% of the training samples and 21% of the computational budget, it outperforms full-data counterparts under both GRPO and SFT (Dong et al., 27 Oct 2025).
These results suggest that OnVTG training will likely depend as much on annotation quality control, reward design, and difficulty-aware curricula as on the decoder architecture itself, especially when supervision is noisy, delayed, or incomplete.
5. Applications and empirical behavior
RAVEN is the clearest example of OnVTG in deployed form. It is designed for advertisement video violation detection with temporal grounding and is explicitly integrated into online ad services through a pipeline containing RAVEN Review, advertisers appeal, manual review, and model iteration (Ji et al., 18 Oct 2025). In day-long online A/B tests with 20% traffic, the reported online sample averages were category precision/recall 2 and grounding mIoU 3 for RAVEN, compared with 4 and 5 for Qwen2.5-VL-7B-SFT. This is a domain-specific instance, but it establishes that temporally grounded MLLM reasoning can support production moderation workflows.
Temporal grounding also acts as an upstream module for downstream reasoning. UniTime is used as a preliminary moment retriever for long-form VideoQA: given a question, it predicts a temporal window 6, crops the video to that segment, samples 32 frames, and feeds them to Qwen2-VL-7B for multiple-choice QA (Li et al., 23 Jun 2025). The paper reports accuracy gains from 49.60 to 55.51 on QaEgo4D, from 33.87 to 40.30 on CG-Bench, from 60.53 to 66.50 on MLVU, and from 54.82 to 56.47 on LongVideoBench. This suggests that an OnVTG system can function as a temporal filter, routing only relevant buffered intervals to heavier reasoning modules.
Training-free tool use provides another empirical pattern. VTimeCoT augments Qwen2VL-7B or GPT-4o with a progress bar, highlighting, and a visuotemporal chain of thought, and reports mIoU gains on Charades-STA and QVHighlights—for example, from 24.34 to 43.41 on Charades-STA and from 22.77 to 46.21 on QVHighlights for the Qwen2VL-7B backbone (Zhang et al., 16 Oct 2025). ChatVTG, also training-free and zero-shot, surpasses prior zero-shot methods on Charades-STA, ActivityNet-Captions, and TACoS through multi-granularity captioning plus query–caption matching (Qu et al., 2024). Together these results indicate that strong temporal grounding can emerge from explicit temporal tools, caption-based retrieval, or direct timestamp generation, and that the choice among them may depend on latency, data availability, and whether the system must operate continuously.
6. Limitations and research directions
A persistent misconception is that recent MLLM VTG systems are already online because they process long videos or use autoregressive decoders. In the cited literature, most systems remain explicitly offline. UniTime requires a fixed-length input chunk and states that it does not explicitly address online / streaming / causal operation (Li et al., 23 Jun 2025). VTG-LLM samples 96 frames and is trained and evaluated in a purely offline regime with bidirectional attention (Guo et al., 2024). TRACE is causal only in the event sequence, not in wall-clock time (Guo et al., 2024). ChatVTG uses full-video coarse segmentation and non-causal sliding-window refinement (Qu et al., 2024). VTimeCoT assumes access to the entire video to sample frames, run VideoCLIP-XL, and draw a global progress bar (Zhang et al., 16 Oct 2025).
The shared technical bottlenecks are also consistent across papers. Context windows remain limited: UniTime is bounded by a 16,384-token context, and VTG-LLM compresses 96 sampled frames into 256 slots partly to stay within the LLM context (Li et al., 23 Jun 2025, Guo et al., 2024). Compute latency is nontrivial for 7B-parameter MLLMs, especially when repeated window updates or multi-stage inference are required; fixed segment lengths, non-causal visual attention, and the absence of streaming-specific training are repeatedly identified as obstacles (Yang et al., 3 Aug 2025, Ji et al., 18 Oct 2025). Some methods also ignore audio, and several papers note difficulty with long histories, repeated events, or multiple disjoint intervals.
A major recent direction is to recover temporal localization from internal attention rather than only from the generated timestamp string. “MLLMs Know When Before Speaking” argues that MLLMs often know the target interval during prefill but lose this signal during autoregressive decoding, identifies TG-Heads through a Grounding Contribution Score, converts TG-Head attention into a debiased frame-level relevance signal, and then re-invokes the model on restricted visual context; it reports gains of up to +3.5 mIoU on MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B without parameter updates or architectural changes (Du et al., 21 May 2026). This suggests that future OnVTG systems may maintain a causal temporal saliency state derived from attention, then regenerate or mask restricted context only when needed, instead of relying exclusively on direct timestamp generation at every update.
Taken together, the literature suggests that practical OnVTG will likely combine explicit temporal cues, clip- or event-level hierarchical search, compact temporal memory, annotation-quality control, and selective re-grounding. The open problem is not whether MLLMs can localize time in principle, but how to turn strong offline temporal reasoning into causal, low-latency, and continuously updateable grounding over live video streams.