ChronusOmni: DualGround for Temporal Grounding

Updated 14 December 2025

ChronusOmni is a framework built on DualGround’s dual-branch design that enables precise audiovisual temporal grounding by aligning video segments with natural language queries.
It employs a sentence-level branch for capturing global semantics via the [EOS] token and a phrase-level branch for detailed local semantic alignment using recurrent phrase generation and slot-attention.
The approach demonstrates strong performance on benchmarks like QVHighlights and Charades-STA, with high recall rates and effective integration of multi-scale temporal features.

ChronusOmni is not referenced in the provided source (Kang et al., 23 Oct 2025). The central subject described in this work is DualGround, a dual-branch architecture for structured phrase and sentence-level temporal grounding within the domain of Video Temporal Grounding (VTG). All information below concerns DualGround as defined and examined in the source.

1. Formalization of Audiovisual Temporal Grounding

Audiovisual temporal grounding involves localizing segments in long, untrimmed videos corresponding to natural language queries. This is instantiated by two primary subtasks: Moment Retrieval (MR) and Highlight Detection (HD). Formally, a video is represented by $T$ clip-level features $V = \{v_1,\ldots,v_T\}, v_i \in \mathbb{R}^d$ and a tokenized query yields $L-1$ word embeddings $\{e_1,\ldots,e_{L-1}\}$ plus a special [EOS] token $e_{[EOS]} \in \mathbb{R}^d$ . The objective for MR is to predict $(\hat{s}, \hat{e})$ (start, end indices) maximizing overlap with the annotated segment, while HD assigns each clip $i$ a saliency score $s_i \in \mathbb{R}$ denoting its relevance. The function modeled is

$f : (V, \{e_{l}\}, e_{[EOS]}) \rightarrow ((\hat{s}, \hat{e}), \{s_i\})$

MR is evaluated by Recall@1@IoU={0.5,0.7}; HD by mAP and HIT@1.

2. DualGround Architecture: Sentence-Level and Phrase-Level Branches

2.1 Sentence-Level Branch (Global Semantics)

The sentence-level branch isolates the [EOS] embedding and augments it with $L_d$ trainable dummy tokens $D = \{d_1, ..., d_{L_d}\}$ , constructing $E = [D; e_{[EOS]}] \in \mathbb{R}^{(L_d + 1) \times d}$ . This sequence is refined by a lightweight Transformer encoder to produce $[D'; e_{[EOS]}]$ . Video features $V$ are projected to queries $Q = \{q_i\}$ , while $E'$ gives keys $K = \{k_j\}$ and values $U = \{u_j\}$ . Cross-modal attention exclusively attends to the [EOS] slot ( $j = L_d + 1$ ):

$\alpha_i = \text{softmax}\left(\frac{q_i \cdot k_j}{\sqrt{d}}\right)\Big|_{j=L_d+1}, \qquad v_i^s = \alpha_i \cdot u_{L_d+1}$

Temporal self-attention layers are stacked over $\{v_i^s\}$ to yield $V^s \in \mathbb{R}^{T \times d}$ .

2.2 Phrase-Level Branch (Local Semantics)

Word tokens are clustered into $N$ semantically coherent phrases for localized alignment. Generation involves Recurrent Phrase Generation (RPG), slot-attention refinement, and phrase-clip context embedding:

RPG recursively generates phrases $p^{(n)}$ for $n = 1, ..., N$ :

$g^{(1)} = \varphi(W_q^{(1)} e_{[EOS]}, 0),\quad g^{(n)} = \varphi(W_q^{(n)} e_{[EOS]}, p^{(n-1)})$

$p^{(n)} = \sum_{l=1}^{L-1} \text{softmax}\left(\frac{g^{(n)} \cdot e_l}{\sqrt{d}}\right) e_l$

Initial phrase set $P_i$ is refined through slot-attention and augmented with a learnable $P_{[EOS]}$ , allowing global context propagation.
Phrase-clip context is captured by projecting and interacting features with Hadamard product:

$C_{n, t} = f_{ctx}(f_p(p^{(n)}) \odot f_v(v_t)),\quad (f_p, f_v, f_{ctx}\text{: MLPs with GELU})$

Aggregation is guided by $P_{[EOS]}$ attending to refined phrases for temporal fusion:

$v_{p,t} = \sum_{n=1}^N \text{softmax}\left(\frac{\langle W_q P_{[EOS]}, W_k p^{(n)} \rangle}{\sqrt{d}}\right) C_{n, t}$

Final output $V^p \in \mathbb{R}^{T \times d}$ is produced with temporal self-attention.

2.3 Fusion, Temporal Pyramid, and Decoding

The output streams are merged via addition: $F = V^s + V^p$ . Prediction heads use a multi-scale temporal pyramid (1D convolution at several resolutions), sharing heads for moment confidence, normalized start/end regression (MR), and saliency (HD).

3. Token-Role Aware Interaction and Objective Functions

Role-aware attention mechanisms avoid the documented over-reliance on [EOS] found in previous VTG frameworks, enabling more granular word-level and phrase-level grounding. Core equations:

Sentence-Level ACA:

$\alpha_i = \text{softmax}\left(\frac{q_i\cdot k_j}{\sqrt{d}}\right)_{j=L_d+1},\quad ACA(v_i) = \alpha_i u_{L_d+1}$

Phrase-Clip context:

$C = f_{ctx}(f_p(P) \odot f_v(V))$

Phrase-Guided Aggregation:

$v_{p,t} = \sum_{n=1}^N \text{softmax}\left(\frac{\langle W_q P_{[EOS]}, W_k p^{(n)} \rangle}{\sqrt{d}}\right) C_{n,t}$

The total training loss is

$\mathcal{L}_\text{total} = \lambda_\text{mr} \mathcal{L}_\text{mr} + \lambda_\text{hd} \mathcal{L}_\text{hd} + \lambda_\text{phrase} ( \mathcal{L}_\text{DQA} + \mathcal{L}_\text{EOS})$

with MR loss comprising focal classification and $L_1$ regression, HD loss combining ranking and contrastive terms for both saliency scores and sentence attention, DQA loss enforcing phrase orthogonality, and [EOS] reconstruction loss aligning phrase-level and global representations.

4. Feature Extraction, Implementation, and Evaluation

Experiments utilize QVHighlights and Charades-STA datasets with CLIP + SlowFast or InternVideo2 backbones. The pipeline caches pretrained features; no fine-tuning on extractors is performed. Architecturally, DualGround operates at $d = 256$ hidden size, with post-norm Transformers, AdamW optimization, and 8 attention heads. For moment proposal post-processing, non-maximum suppression is applied at IoU = 0.7.

Benchmark evaluations include Recall@1@IoU=0.5/0.7, mAP, VG-Hit@1, VG-mAP. DualGround demonstrates superior performance on QVHighlights (InternVideo2): [email protected] = 71.87%, [email protected] = 56.94%, mAP = 52.73%. On Charades-STA (InternVideo2) [email protected] = 70.67%, [email protected] = 50.33% (Kang et al., 23 Oct 2025).

Dataset	[email protected] (%)	[email protected] (%)	mAP (%)	VG-Hit@1 (%)	VG-mAP (%)
QVHighlights	71.87	56.94	52.73	70.80	44.02
Charades-STA	70.67	50.33	—	—	—
FlashVTG (base)	70.69	53.96	52.00	71.00	44.09

5. Ablation Studies and Qualitative Findings

Ablation reveals optimal phrase count ( $N = 4$ for QVHighlights, $N = 3$ for Charades-STA), with degraded performance at extremes. RPG yields +1.7% [email protected]; slot-attention and DQA further boost metrics (+0.8%, +1.2% respectively). Prior approaches show strong attention correlation to $\approx 0.97$ $\approx 0.97$ Pearson" title="" rel="nofollow" data-turbo="false" class="assistant-link">EOS; disabling word tokens is harmful for CLIP features but variably impactful with InternVideo2, highlighting the importance of role-aware separation.

Qualitative examples underscore DualGround's granularity: queries like "the lady in red jacket comes into the room" are localized precisely—baseline models disproportionately predict broader segments. Visualization of phrase-clip activation norms reveals sharp semantic alignment corresponding to phrase boundaries.

6. Limitations and Prospects

DualGround uses a fixed phrase count $N$ , which must be empirically tuned per dataset; learning $N$ dynamically is an open direction. The current approach does not utilize audio features; future work may incorporate audio via cross-modal ACA over spectrogram tokens to improve multimodal event grounding. While computational overhead of the dual-branch approach is modest, efficiency may improve with learned phrase boundaries. As vision-language encoders strengthen global [EOS] signals, the necessity of disentangled semantic modeling increases.

This suggests that further decoupling of local/global representations and adaptive phrase grouping will be vital for continued progress as video-LLMs evolve. (Kang et al., 23 Oct 2025)

Markdown Upgrade to Chat

References (1)

Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChronusOmni.