Daily-Omni Benchmark Overview

Updated 17 September 2025

Daily-Omni Benchmark is a comprehensive framework designed to assess synchronous cross-modal reasoning by integrating audio-visual signals in daily-life scenarios.
It employs a multi-stage QA generation and annotation pipeline, leveraging models like Qwen2.5-VL-7B and Whisper-Large-V2 for precise temporal alignment and context understanding.
Smart temporal alignment in the benchmark improves accuracy (up to 61.82%), underscoring the need for robust audio-visual integration in multimodal large language models.

The Daily-Omni Benchmark is a rigorously constructed evaluation framework targeting synchronous cross-modal reasoning in Multimodal LLMs (MLLMs), with a primary emphasis on audio-visual alignment and integration. The benchmark addresses the still-unresolved capacity of modern MLLMs to combine audio and visual signals in temporally coherent, context-rich daily-life scenarios, quantifying their performance on fine-grained multiple-choice QA tasks. It defines new methodologies for large-scale benchmark generation, annotation, and baseline evaluation, establishing a reference point in the assessment of MLLMs on day-to-day, semantically and temporally complex tasks.

1. Design and Dataset Composition

Daily-Omni comprises a set of 684 videos selected from diverse video sources including AudioSet, Video-MME, and FineVideo datasets. Video materials span 11 distinct YouTube categories to maximize both content diversity and realism in daily-life event coverage. Each video is segmented into 30 s or 60 s clips, with corresponding QA tasks targeting events within these precise temporal windows—647 QAs for 30 s and 550 QAs for 60 s segments—producing a total of 1,197 multiple-choice questions. The QAs are distributed over six key task types:

Audio-Visual Event Alignment: Identifying pairs of co-occurring events across modalities.
Event Sequence: Determining the chronological ordering of events.
Reasoning: Explaining underlying causes behind audio-visual occurrences.
Inference: Extracting implicit information not directly presented in the scene.
Comparative: Comparing similarities or differences between various events.
Context Understanding: Deciphering the broader situational context.

This multiplicity is designed to challenge both modality-specific and cross-modal reasoning mechanisms.

2. QA Generation and Annotation Pipeline

The benchmark leverages a multi-stage automatic QA and annotation pipeline, significantly optimizing human effort and scaling annotation quality. The pipeline consists of:

Video Annotation: Gemini 2.0 Flash segments each clip and provides detailed visual and audio stream annotations per segment.
Annotation Revision: Consistency checks are applied at the full-clip level to guarantee annotation continuity (e.g., tracking recurring entities across segments).
Audio-Visual Event Alignment: Fine-grained temporal alignment identifies which audio and visual events are truly concurrent.
QA Generation: Deepseek-R1 generates candidate multi-choice questions corresponding to annotated segments and event pairs, structured to address every task type.
QA Optimization: Validity filters (using GPT-4o and Deepseek-V3) remove questions answerable without the multimodal context, tuning distractor difficulty.

This process yields a high-quality QA corpus while enabling efficient manual review (30 h for completion with a 30% acceptance rate), setting a precedent for scalable benchmark curation.

3. Baseline Agent and Evaluation Protocol

Daily-Omni introduces the Daily-Omni-Agent, a training-free ensemble baseline composed of the following open-source models:

Visual LLM (VLM): Qwen2.5-VL-7B for segment-level visual annotation and event description.
Audio LLM (ALM): Qwen2-Audio (7B) for non-speech and auditory event annotation.
Automatic Speech Recognition (ASR): Whisper-Large-V2 for high-fidelity speech and singing transcription.
Text Reasoning LLM: Qwen2.5-14B-Instruct for integrating multimodal annotations with question and answer choice reasoning.

The agent workflow splits both audio and visual streams into segments, generates separate annotations, and applies targeted temporal alignment: Rather than exhaustively aligning all event pairs, the agent localizes and grounds only those events essential to correct answering (using Qwen2.5-VL-7B as the temporal grounding module), extracting start/end timestamps and pairing critical concurrent event cues.

4. Temporal Alignment and Its Impact

Two levels of temporal alignment are investigated:

Naive Alignment: All event pairs with recorded temporal boundaries are matched unsystematically, often resulting in unmanageable annotation complexity and low precision for downstream models.
Smart Alignment: For each question, only temporally necessary event pairs (average 1.11 per QA) are aligned by cross-model evidence synthesis—the VLM and ALM highlight critical visual and corresponding audio events, eliminating irrelevant alignments.

Empirical results demonstrate a clear quantitative advantage: Smart alignment improves overall accuracy from 60.65% (no alignment) and 59.65% (naive alignment) to 61.82%. Removing the audio stream produces a marked performance drop, validating that audio-visual fusion—rather than textual cues—is fundamental to high-level question answering in Daily-Omni.

Alignment Type	Accuracy (%)	Notes
No alignment	60.65	Baseline, segment-wise-only annotation
Naive alignment	59.65	Overly complex, annotator overload
Smart alignment	61.82	Targeted critical event fusion

This suggests the necessity of context-efficient event selection and the criticality of robust temporal grounding components.

5. Performance Characteristics of MLLM Systems

Daily-Omni reveals core limitations in present MLLMs:

Proprietary models with explicit cross-modal alignment (Gemini 2.0 Flash) reach the highest overall accuracy (67.84%) but no model surpasses the 70% threshold, reflecting persistent challenges in temporal multi-modal integration.
Earlier-generation multimodal systems (Unified-IO 2, VideoLLaMA 2) may underperform text-only LLMs on Daily-Omni sub-tasks, especially those dependent on complex audio-visual relationships.
Removing one modality (audio or visual) produces substantial accuracy declines, confirming each QA’s dependence on integrated multi-channel reasoning.

A plausible implication is that multi-modal generalization in unconstrained temporal domains remains an open problem, and dataset design must continue to challenge weak cross-modal coupling strategies.

6. Broader Benchmarking Context and Extensions

Daily-Omni’s methodologies have influenced broader benchmarking paradigms. In HumanOmniV2 (Yang et al., 26 Jun 2025), explicit global context understanding and reinforcement learning reward shaping (context, logical, format rewards) are used to improve multimodal reasoning performance; IntentBench and WorldSense collectively reflect evolution from pure synchronous temporal alignment toward holistic assessment of human intention and context-aware social reasoning. In CogGuide (Shou et al., 8 Sep 2025), a plug-and-play three-module “intent sketch” pipeline (Intent Perceiver, Strategy Generator, Strategy Selector) further advances zero-shot cross-modal reasoning through entropy reduction and information gain maximization, yielding improvements up to +9.51 pp on Daily-Omni. Moreover, modular benchmarking toolkits such as OmniEvalKit (Zhang et al., 9 Dec 2024) enable daily, automated model evaluation over multi-domain and multimodal datasets with lightweight extensibility.

7. Future Directions and Outstanding Challenges

Current research emphasizes:

Developing more robust and precise video temporal grounding models, enabling extraction and alignment of key event pairs with minimal annotation burden.
Joint or end-to-end training strategies for multimodal systems, integrating cross-modal perception and reasoning, potentially improving performance in complex and noisy real-world input domains.
Expanding benchmark scope and complexity by increasing dataset size, variety, and inclusion of further context-dependent, temporally rich everyday scenarios.

This suggests that further progress in audio-visual multimodal QA will depend on increasingly adaptive, context-sensitive modeling pipelines, as well as the systematic enhancement of benchmark methodologies to better reflect real-life event complexity and temporality.

8. Evaluation Metrics and Mathematical Definitions

Performance on Daily-Omni is primarily quantified using overall accuracy:

$\text{Accuracy} = \left( \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}} \right) \times 100\%$

Additional metrics, such as alignment precision and annotator acceptance rates, support continuous benchmarking. This rigorous evaluation, coupled with qualitative advancement in temporal alignment and context reasoning, defines the benchmark's lasting impact on the multimodal model assessment landscape.