ChronusOmni: DualGround for Temporal Grounding
- ChronusOmni is a framework built on DualGround’s dual-branch design that enables precise audiovisual temporal grounding by aligning video segments with natural language queries.
- It employs a sentence-level branch for capturing global semantics via the [EOS] token and a phrase-level branch for detailed local semantic alignment using recurrent phrase generation and slot-attention.
- The approach demonstrates strong performance on benchmarks like QVHighlights and Charades-STA, with high recall rates and effective integration of multi-scale temporal features.
ChronusOmni is not referenced in the provided source (Kang et al., 23 Oct 2025). The central subject described in this work is DualGround, a dual-branch architecture for structured phrase and sentence-level temporal grounding within the domain of Video Temporal Grounding (VTG). All information below concerns DualGround as defined and examined in the source.
1. Formalization of Audiovisual Temporal Grounding
Audiovisual temporal grounding involves localizing segments in long, untrimmed videos corresponding to natural language queries. This is instantiated by two primary subtasks: Moment Retrieval (MR) and Highlight Detection (HD). Formally, a video is represented by clip-level features and a tokenized query yields word embeddings plus a special [EOS] token . The objective for MR is to predict (start, end indices) maximizing overlap with the annotated segment, while HD assigns each clip a saliency score denoting its relevance. The function modeled is
MR is evaluated by Recall@1@IoU={0.5,0.7}; HD by mAP and HIT@1.
2. DualGround Architecture: Sentence-Level and Phrase-Level Branches
2.1 Sentence-Level Branch (Global Semantics)
The sentence-level branch isolates the [EOS] embedding and augments it with trainable dummy tokens , constructing . This sequence is refined by a lightweight Transformer encoder to produce . Video features are projected to queries , while gives keys and values . Cross-modal attention exclusively attends to the [EOS] slot ():
Temporal self-attention layers are stacked over to yield .
2.2 Phrase-Level Branch (Local Semantics)
Word tokens are clustered into semantically coherent phrases for localized alignment. Generation involves Recurrent Phrase Generation (RPG), slot-attention refinement, and phrase-clip context embedding:
- RPG recursively generates phrases for :
- Initial phrase set is refined through slot-attention and augmented with a learnable , allowing global context propagation.
- Phrase-clip context is captured by projecting and interacting features with Hadamard product:
- Aggregation is guided by attending to refined phrases for temporal fusion:
Final output is produced with temporal self-attention.
2.3 Fusion, Temporal Pyramid, and Decoding
The output streams are merged via addition: . Prediction heads use a multi-scale temporal pyramid (1D convolution at several resolutions), sharing heads for moment confidence, normalized start/end regression (MR), and saliency (HD).
3. Token-Role Aware Interaction and Objective Functions
Role-aware attention mechanisms avoid the documented over-reliance on [EOS] found in previous VTG frameworks, enabling more granular word-level and phrase-level grounding. Core equations:
- Sentence-Level ACA:
- Phrase-Clip context:
- Phrase-Guided Aggregation:
The total training loss is
with MR loss comprising focal classification and regression, HD loss combining ranking and contrastive terms for both saliency scores and sentence attention, DQA loss enforcing phrase orthogonality, and [EOS] reconstruction loss aligning phrase-level and global representations.
4. Feature Extraction, Implementation, and Evaluation
Experiments utilize QVHighlights and Charades-STA datasets with CLIP + SlowFast or InternVideo2 backbones. The pipeline caches pretrained features; no fine-tuning on extractors is performed. Architecturally, DualGround operates at hidden size, with post-norm Transformers, AdamW optimization, and 8 attention heads. For moment proposal post-processing, non-maximum suppression is applied at IoU = 0.7.
Benchmark evaluations include Recall@1@IoU=0.5/0.7, mAP, VG-Hit@1, VG-mAP. DualGround demonstrates superior performance on QVHighlights (InternVideo2): [email protected] = 71.87%, [email protected] = 56.94%, mAP = 52.73%. On Charades-STA (InternVideo2) [email protected] = 70.67%, [email protected] = 50.33% (Kang et al., 23 Oct 2025).
| Dataset | [email protected] (%) | [email protected] (%) | mAP (%) | VG-Hit@1 (%) | VG-mAP (%) |
|---|---|---|---|---|---|
| QVHighlights | 71.87 | 56.94 | 52.73 | 70.80 | 44.02 |
| Charades-STA | 70.67 | 50.33 | — | — | — |
| FlashVTG (base) | 70.69 | 53.96 | 52.00 | 71.00 | 44.09 |
5. Ablation Studies and Qualitative Findings
Ablation reveals optimal phrase count ( for QVHighlights, for Charades-STA), with degraded performance at extremes. RPG yields +1.7% [email protected]; slot-attention and DQA further boost metrics (+0.8%, +1.2% respectively). Prior approaches show strong attention correlation to Pearson" title="" rel="nofollow" data-turbo="false" class="assistant-link">EOS; disabling word tokens is harmful for CLIP features but variably impactful with InternVideo2, highlighting the importance of role-aware separation.
Qualitative examples underscore DualGround's granularity: queries like "the lady in red jacket comes into the room" are localized precisely—baseline models disproportionately predict broader segments. Visualization of phrase-clip activation norms reveals sharp semantic alignment corresponding to phrase boundaries.
6. Limitations and Prospects
DualGround uses a fixed phrase count , which must be empirically tuned per dataset; learning dynamically is an open direction. The current approach does not utilize audio features; future work may incorporate audio via cross-modal ACA over spectrogram tokens to improve multimodal event grounding. While computational overhead of the dual-branch approach is modest, efficiency may improve with learned phrase boundaries. As vision-language encoders strengthen global [EOS] signals, the necessity of disentangled semantic modeling increases.
This suggests that further decoupling of local/global representations and adaptive phrase grouping will be vital for continued progress as video-LLMs evolve. (Kang et al., 23 Oct 2025)