Papers
Topics
Authors
Recent
2000 character limit reached

ChronusOmni: DualGround for Temporal Grounding

Updated 14 December 2025
  • ChronusOmni is a framework built on DualGround’s dual-branch design that enables precise audiovisual temporal grounding by aligning video segments with natural language queries.
  • It employs a sentence-level branch for capturing global semantics via the [EOS] token and a phrase-level branch for detailed local semantic alignment using recurrent phrase generation and slot-attention.
  • The approach demonstrates strong performance on benchmarks like QVHighlights and Charades-STA, with high recall rates and effective integration of multi-scale temporal features.

ChronusOmni is not referenced in the provided source (Kang et al., 23 Oct 2025). The central subject described in this work is DualGround, a dual-branch architecture for structured phrase and sentence-level temporal grounding within the domain of Video Temporal Grounding (VTG). All information below concerns DualGround as defined and examined in the source.

1. Formalization of Audiovisual Temporal Grounding

Audiovisual temporal grounding involves localizing segments in long, untrimmed videos corresponding to natural language queries. This is instantiated by two primary subtasks: Moment Retrieval (MR) and Highlight Detection (HD). Formally, a video is represented by TT clip-level features V={v1,,vT},viRdV = \{v_1,\ldots,v_T\}, v_i \in \mathbb{R}^d and a tokenized query yields L1L-1 word embeddings {e1,,eL1}\{e_1,\ldots,e_{L-1}\} plus a special [EOS] token e[EOS]Rde_{[EOS]} \in \mathbb{R}^d. The objective for MR is to predict (s^,e^)(\hat{s}, \hat{e}) (start, end indices) maximizing overlap with the annotated segment, while HD assigns each clip ii a saliency score siRs_i \in \mathbb{R} denoting its relevance. The function modeled is

f:(V,{el},e[EOS])((s^,e^),{si})f : (V, \{e_{l}\}, e_{[EOS]}) \rightarrow ((\hat{s}, \hat{e}), \{s_i\})

MR is evaluated by Recall@1@IoU={0.5,0.7}; HD by mAP and HIT@1.

2. DualGround Architecture: Sentence-Level and Phrase-Level Branches

2.1 Sentence-Level Branch (Global Semantics)

The sentence-level branch isolates the [EOS] embedding and augments it with LdL_d trainable dummy tokens D={d1,...,dLd}D = \{d_1, ..., d_{L_d}\}, constructing E=[D;e[EOS]]R(Ld+1)×dE = [D; e_{[EOS]}] \in \mathbb{R}^{(L_d + 1) \times d}. This sequence is refined by a lightweight Transformer encoder to produce [D;e[EOS]][D'; e_{[EOS]}]. Video features VV are projected to queries Q={qi}Q = \{q_i\}, while EE' gives keys K={kj}K = \{k_j\} and values U={uj}U = \{u_j\}. Cross-modal attention exclusively attends to the [EOS] slot (j=Ld+1j = L_d + 1):

αi=softmax(qikjd)j=Ld+1,vis=αiuLd+1\alpha_i = \text{softmax}\left(\frac{q_i \cdot k_j}{\sqrt{d}}\right)\Big|_{j=L_d+1}, \qquad v_i^s = \alpha_i \cdot u_{L_d+1}

Temporal self-attention layers are stacked over {vis}\{v_i^s\} to yield VsRT×dV^s \in \mathbb{R}^{T \times d}.

2.2 Phrase-Level Branch (Local Semantics)

Word tokens are clustered into NN semantically coherent phrases for localized alignment. Generation involves Recurrent Phrase Generation (RPG), slot-attention refinement, and phrase-clip context embedding:

  • RPG recursively generates phrases p(n)p^{(n)} for n=1,...,Nn = 1, ..., N:

g(1)=φ(Wq(1)e[EOS],0),g(n)=φ(Wq(n)e[EOS],p(n1))g^{(1)} = \varphi(W_q^{(1)} e_{[EOS]}, 0),\quad g^{(n)} = \varphi(W_q^{(n)} e_{[EOS]}, p^{(n-1)})

p(n)=l=1L1softmax(g(n)eld)elp^{(n)} = \sum_{l=1}^{L-1} \text{softmax}\left(\frac{g^{(n)} \cdot e_l}{\sqrt{d}}\right) e_l

  • Initial phrase set PiP_i is refined through slot-attention and augmented with a learnable P[EOS]P_{[EOS]}, allowing global context propagation.
  • Phrase-clip context is captured by projecting and interacting features with Hadamard product:

Cn,t=fctx(fp(p(n))fv(vt)),(fp,fv,fctx: MLPs with GELU)C_{n, t} = f_{ctx}(f_p(p^{(n)}) \odot f_v(v_t)),\quad (f_p, f_v, f_{ctx}\text{: MLPs with GELU})

  • Aggregation is guided by P[EOS]P_{[EOS]} attending to refined phrases for temporal fusion:

vp,t=n=1Nsoftmax(WqP[EOS],Wkp(n)d)Cn,tv_{p,t} = \sum_{n=1}^N \text{softmax}\left(\frac{\langle W_q P_{[EOS]}, W_k p^{(n)} \rangle}{\sqrt{d}}\right) C_{n, t}

Final output VpRT×dV^p \in \mathbb{R}^{T \times d} is produced with temporal self-attention.

2.3 Fusion, Temporal Pyramid, and Decoding

The output streams are merged via addition: F=Vs+VpF = V^s + V^p. Prediction heads use a multi-scale temporal pyramid (1D convolution at several resolutions), sharing heads for moment confidence, normalized start/end regression (MR), and saliency (HD).

3. Token-Role Aware Interaction and Objective Functions

Role-aware attention mechanisms avoid the documented over-reliance on [EOS] found in previous VTG frameworks, enabling more granular word-level and phrase-level grounding. Core equations:

  • Sentence-Level ACA:

αi=softmax(qikjd)j=Ld+1,ACA(vi)=αiuLd+1\alpha_i = \text{softmax}\left(\frac{q_i\cdot k_j}{\sqrt{d}}\right)_{j=L_d+1},\quad ACA(v_i) = \alpha_i u_{L_d+1}

  • Phrase-Clip context:

C=fctx(fp(P)fv(V))C = f_{ctx}(f_p(P) \odot f_v(V))

  • Phrase-Guided Aggregation:

vp,t=n=1Nsoftmax(WqP[EOS],Wkp(n)d)Cn,tv_{p,t} = \sum_{n=1}^N \text{softmax}\left(\frac{\langle W_q P_{[EOS]}, W_k p^{(n)} \rangle}{\sqrt{d}}\right) C_{n,t}

The total training loss is

Ltotal=λmrLmr+λhdLhd+λphrase(LDQA+LEOS)\mathcal{L}_\text{total} = \lambda_\text{mr} \mathcal{L}_\text{mr} + \lambda_\text{hd} \mathcal{L}_\text{hd} + \lambda_\text{phrase} ( \mathcal{L}_\text{DQA} + \mathcal{L}_\text{EOS})

with MR loss comprising focal classification and L1L_1 regression, HD loss combining ranking and contrastive terms for both saliency scores and sentence attention, DQA loss enforcing phrase orthogonality, and [EOS] reconstruction loss aligning phrase-level and global representations.

4. Feature Extraction, Implementation, and Evaluation

Experiments utilize QVHighlights and Charades-STA datasets with CLIP + SlowFast or InternVideo2 backbones. The pipeline caches pretrained features; no fine-tuning on extractors is performed. Architecturally, DualGround operates at d=256d = 256 hidden size, with post-norm Transformers, AdamW optimization, and 8 attention heads. For moment proposal post-processing, non-maximum suppression is applied at IoU = 0.7.

Benchmark evaluations include Recall@1@IoU=0.5/0.7, mAP, VG-Hit@1, VG-mAP. DualGround demonstrates superior performance on QVHighlights (InternVideo2): [email protected] = 71.87%, [email protected] = 56.94%, mAP = 52.73%. On Charades-STA (InternVideo2) [email protected] = 70.67%, [email protected] = 50.33% (Kang et al., 23 Oct 2025).

Dataset [email protected] (%) [email protected] (%) mAP (%) VG-Hit@1 (%) VG-mAP (%)
QVHighlights 71.87 56.94 52.73 70.80 44.02
Charades-STA 70.67 50.33
FlashVTG (base) 70.69 53.96 52.00 71.00 44.09

5. Ablation Studies and Qualitative Findings

Ablation reveals optimal phrase count (N=4N = 4 for QVHighlights, N=3N = 3 for Charades-STA), with degraded performance at extremes. RPG yields +1.7% [email protected]; slot-attention and DQA further boost metrics (+0.8%, +1.2% respectively). Prior approaches show strong attention correlation to 0.97\approx 0.97 Pearson" title="" rel="nofollow" data-turbo="false" class="assistant-link">EOS; disabling word tokens is harmful for CLIP features but variably impactful with InternVideo2, highlighting the importance of role-aware separation.

Qualitative examples underscore DualGround's granularity: queries like "the lady in red jacket comes into the room" are localized precisely—baseline models disproportionately predict broader segments. Visualization of phrase-clip activation norms reveals sharp semantic alignment corresponding to phrase boundaries.

6. Limitations and Prospects

DualGround uses a fixed phrase count NN, which must be empirically tuned per dataset; learning NN dynamically is an open direction. The current approach does not utilize audio features; future work may incorporate audio via cross-modal ACA over spectrogram tokens to improve multimodal event grounding. While computational overhead of the dual-branch approach is modest, efficiency may improve with learned phrase boundaries. As vision-language encoders strengthen global [EOS] signals, the necessity of disentangled semantic modeling increases.

This suggests that further decoupling of local/global representations and adaptive phrase grouping will be vital for continued progress as video-LLMs evolve. (Kang et al., 23 Oct 2025)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ChronusOmni.