ChronusAV: Audiovisual Temporal Grounding Dataset
- ChronusAV is a large-scale audiovisual dataset with temporally precise and cross-modal aligned annotations for fine-grained temporal grounding.
- It supports six distinct temporal grounding subtasks enabling both explicit and implicit synchronization queries across audio and visual channels.
- Benchmark results highlight significant accuracy and captioning improvements using explicit temporal interleaving and dedicated multi-modal fine-tuning techniques.
ChronusAV is a large-scale audiovisual dataset constructed to facilitate fine-grained temporal grounding within and across visual and audio modalities. It is intended to overcome limitations in prior benchmarks, which either neglect the audio channel entirely or provide only coarse, joint audio–video annotations. ChronusAV advances unified multi-modal learning by offering temporally accurate, modality-complete, and cross-modal-aligned annotations across a diverse spectrum of long-form videos. By supporting six distinct temporal grounding subtasks—explicit and implicit—it enables rigorous evaluation and training of omni-modal LLMs with strong temporal and cross-modal reasoning capabilities (Chen et al., 10 Dec 2025).
1. Motivation and Core Properties
ChronusAV directly addresses gaps in existing temporal grounding datasets by explicitly modeling both audio and visual information with temporal precision. The dataset is designed to support six temporal grounding subtasks: four explicit (Video→Time, Time→Video, Audio→Time, Time→Audio) and two implicit (Video→Audio, Audio→Video). These subtasks reflect practical needs in audiovisual synchronization, for example, answering “What do you see when this line is spoken?” or “What is said at the precise moment of a visual event?”
Key properties:
- Temporally accurate: 677,000 annotated segments distributed over 47,000 long-form videos (2,922 hours total), each segment precisely timestamped (average duration 15.5 seconds; mean video length 226 seconds).
- Modality complete: Each segment receives a visual caption (describes sight only) and an audio caption (transcribes speech, describes non-speech sounds and music).
- Cross-modal alignment: Every audiovisual segment is temporally aligned, allowing fully cross-modal queries and answers for both explicit and implicit grounding directions.
A plausible implication is that ChronusAV underpins the development and evaluation of omni-modal models capable of temporally exact event retrieval, fine-grained audio–visual indexing, and robust handling of cross-modal synchronization tasks.
2. Data Acquisition, Segmentation, and Annotation Protocol
ChronusAV sources all videos from Panda-70M, an open-domain English-language corpus. Stringent filtering selects untrimmed, multi-shot videos 60–600 seconds in length, covering 15 diverse real-world categories including “Cooking,” “Interviews,” “Sports,” and “Music Performance.” The emphasis on diversity ensures broad coverage for temporal and multi-modal reasoning.
The data pipeline includes:
- Scene Segmentation: Initial split into visual scenes, followed by merging of semantically similar scenes to produce segments indicative of coherent events (5–30 segments per video).
- Audio–Video Alignment: Scene boundaries imposed identically on both audio and video tracks to guarantee segment-level temporal synchrony.
- Modality-Specific Captions: Visual captions generated via Gemini-2.5-Flash with prompts discouraging auditory references; audio captions generated via Gemini-2.5-Pro with prompts enforcing verbatim speech transcription and identification of distinctive audio events, discouraging interpretation.
- Human Verification: Three annotators rate a random sample of 1,000 segments for semantic accuracy (3-point Likert scale) and cross-modal leakage (e.g., video captions referencing audio events). Fleiss’ indicates high annotator agreement. Visual captions: 96.1% Accurate/Acceptable; audio captions: 93.5%. Cross-modal leakage minimal (99.3% in video, 97.5% in audio captions show no or only minor leakage).
This annotation pipeline results in reliable cross-modal data, setting an objective standard for evaluating multi-modal temporal models.
3. Segment Structure, Metadata, and Temporal Tokenization
Each segment in ChronusAV retains an absolute interval in seconds, formatted as text:
second{t_i} – second{t_{i+1}}
Timestamp tokens are provided as natural text, enabling models to ingest time alongside other modalities directly. Each segment has detailed metadata structured in JSON:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
{
"video_id": string,
"duration_seconds": float,
"segments": [
{
"segment_id": int,
"start_time": float,
"end_time": float,
"video_caption": string,
"audio_caption": string
}, …
]
} |
The test QA file follows:
1 |
{ "video_id": string, "segment_id": int, "subtask": string, "question": string, "answer": string } |
Visual captions have a mean length of 29.6 words (unimodal distribution, peak at 20–25 words); audio captions average 50.6 words (long-tailed, reflecting detailed speech transcription).
4. Temporal Grounding Tasks and Evaluation Methodology
ChronusAV enables the formulation of six temporal grounding QA pairs for each segment:
- Given (visual), when does it occur?
- Given , describe the visual content.
- Given (audio), when does it occur?
- Given , transcribe/describe the audio.
- Given , what is said or heard at the same time?
- Given , what do you see at the same time?
The dataset is split into 45,000 videos for training (no QA pairs used) and 2,000 for testing, yielding 12,000 QA pairs—balanced across six subtasks.
Evaluation metrics:
- Explicit grounding (Video→Time, Audio→Time):
- Recall@ at temporal IoU threshold (e.g., ).
An answer is correct if:
where is the predicted interval.
Captioning subtasks (Time→Video, Time→Audio, Video→Audio, Audio→Video):
- BLEU-4, ROUGE-L, METEOR, CIDEr, following video captioning conventions.
5. Baseline Model Performance and Ablation Insights
State-of-the-art comparative results on ChronusAV indicate superior performance for models leveraging explicit cross-modal temporal modeling. The table below presents Recall@$0.5$/@$0.7$ for grounding and mean CIDEr for captioning:
| Model | V2T [email protected] / [email protected] | A2T [email protected] / [email protected] | Avg. Caption CIDEr |
|---|---|---|---|
| Qwen3-Omni (30B) | 37.9 / 21.8 | 46.7 / 33.1 | ~0.9 |
| ARC-Hunyuan-Video (7B) | 36.1 / 23.2 | 36.8 / 24.3 | ~0.4 |
| ChronusOmni (7B) | 63.2 / 45.9 | 90.5 / 79.9 | 34.3 |
ChronusOmni demonstrates absolute gains of +67%/+98% in Video→Time and +94%/+142% in Audio→Time Recall over the next best model. For captioning subtasks, ChronusOmni yields 3–10× improvements in CIDEr relative to baselines. Ablation studies reveal:
- Removing temporal-interleaved tokenization: Collapses grounding accuracy (e.g., V2T [email protected] drops from 45.95 to 6.80).
- Omitting supervised fine-tuning (SFT): Reduces caption quality (CIDEr declines by ~80%).
- Omitting reinforcement learning (GRPO): Dramatically impairs implicit grounding (Video→Audio, Audio→Video).
This suggests that explicit temporal alignment and cross-modal interleaving are critical for state-of-the-art audiovisual grounding.
6. Access, Usage Scenarios, and Future Prospects
ChronusAV is released under an open-source license at https://github.com/YJCX330/Chronus/. Its primary uses include:
- Training and benchmarking of multi-modal temporal grounding models
- Cross-modal event retrieval and indexing
- Fine-grained video/audio event understanding research
- Analysis of implicit audio–visual synchronization phenomena
A plausible implication is the emergence of new paradigms for long-form multi-modal reasoning, including video summarization with explicit temporal hooks.
Future directions indicated are:
- Extending support to hour-long videos and additional modalities (e.g., subtitles).
- Leveraging ChronusAV for downstream temporal summarization tasks.
ChronusAV sets a reference standard for temporally accurate, modality-complete, and cross-genre audiovisual temporal grounding, enabling unified progress in omni-modal large language modeling and related disciplines (Chen et al., 10 Dec 2025).