Omni-Detective: Unified Multimodal Captioning
- Omni-Detective is an umbrella term for systems that integrate agentic investigation and cross-modal tool-calling to achieve detailed, low-hallucination perception.
- It employs independent observers and iterative multi-round validation to systematically refine evidence from audio, visual, and multimodal inputs.
- Its models, including Omni-Captioner and Audio-Captioner, set new benchmarks by balancing factual detail with minimized hallucination in captioning.
Omni-Detective is an umbrella term for a class of systems, algorithms, and agentic pipelines in artificial intelligence that achieve unified, highly detailed, and low-hallucination fine-grained perception and captioning across audio, visual, and multimodal inputs. The core paradigm reimagines data generation, detail extraction, and evaluation by combining autonomous investigative agent frameworks, iterative cross-modal tool-calling, and multi-phase evaluation on newly designed benchmarks. This approach enables new levels of factual detail, calibration of hallucination, and assessment of multimodal LLMs (OLMs) for real-world fine-grained understanding and reasoning (Ma et al., 14 Oct 2025).
1. Agentic Data Generation Pipeline with Tool-Calling
The Omni-Detective data pipeline introduces an agentic investigation process that systematically extracts detailed and verifiable multimodal evidence from audio, video, or audio-visual material. The architecture comprises:
- Detective Agent: A central coordinating agent emulating human investigative behavior, orchestrating multi-round queries and data requests.
- Tool Box: A set of specialist analysis tools, each tailored for specific modalities, such as OCR for on-screen text, ASR for audio speech, and multimodal LLMs (MLLMs) for general perception.
- Independent Observers: Modality-specific observers that focus exclusively on their respective streams, providing grounded evidence.
The agent iteratively “poses” queries to the Tool Box, integrating the resulting evidence from the Observers and updating its internal scene representation. This cyclic “Question–Tool Call–Observation” process continues until the agent accumulates a sufficiently rich set of details. Tool-calling is integral: for instance, when a spoken utterance is ambiguous in a video, the agent may invoke ASR, whereas unclear visual signage prompts an OCR call. The multi-stage process ensures that each new detail is cross-validated by distinct tools or observers, suppressing hallucination.
2. Decoupling Detail and Hallucination
A core challenge in fine-grained perception is the “co-growth” phenomenon, wherein increasing caption detail often introduces spurious, hallucinated information. The Omni-Detective pipeline addresses this with:
- Iterative Multi-Round Validation: Each information extraction round includes cross-checking of new details against previous observations, and resolving inconsistencies.
- Tool-Observer Cross Verification: Evidence detected by one tool (e.g., an MLLM) is verified by independent observers and, if possible, orthogonal tool calls (e.g., both ASR and OCR).
- Convergence and Correction: The process continues until the incremental gain in detail ratio no longer increases hallucination or missing-not-given rates. Empirically, each iteration improves the detail/precision trade-off (as shown by the decreasing “not-given” and hallucination ratios during iterative inquiry in Fig. 2 of (Ma et al., 14 Oct 2025)).
This pipeline thus enables accumulation of highly granular detail without proportional growth in incorrect statements—a marked advance over conventional detailed captioning strategies.
3. Omni-Captioner and Audio-Captioner Models
Omni-Detective supplies the training foundation for two detail-oriented captioning models:
- Audio-Captioner: Trained solely on audio inputs, with the visual encoder frozen, focusing on audio cue alignment and detailed audio-only description.
- Omni-Captioner: Trained on both audio and video, later with all encoders unfrozen, enabling joint, cross-modal captioning that fully exploits input synergy.
The model backbone is based on Qwen-2.5-Omni-7B. Training employs a two-stage curriculum:
- Audio Alignment (Stage 1): Audio-Captioner is first trained with visual encoder frozen using audio-only segments of datasets like VGGSound and high-fidelity Omni-Detective captions.
- Audio–Visual Alignment (Stage 2): Omni-Captioner is then fully fine-tuned (both audio and visual encoders unfrozen) on audio–visual datasets like FineVideo and the extended outputs of Omni-Detective, allowing broader evidence integration.
The optimization goal is
where the input is from either a single or multiple modalities, and the curriculum ensures low hallucination with maximal factual coverage.
4. Benchmarking: Cascade QA and Omni-Cloze
Cascade Evaluation Protocol:
- Detailed captions are evaluated indirectly by passing them through an LLM (cascade QA), which then answers complex downstream reasoning questions.
- This protocol quantifies whether the model’s caption enables correct reasoning or answers (e.g., on MMAU, MMAR, Video-MME, Video-Holmes, Daily-Omni).
- Audio-Captioner achieves leading scores on MMAU (70.0%) and MMAR (59.8%), surpassing most open-source and even some proprietary models (Gemini 2.5 Flash proximity).
Omni-Cloze: A novel cloze-style (fill-in-the-blank) benchmark tailored to assess fine-grained, fact-based coverage. Key features include:
- Over 70,000 cloze items from 2,000 video clips, spanning audio-only, visual-only, and audio-visual configurations.
- Unified multiple-choice format with a special “Not Given” option for blanks where the information is absent, thus systematically measuring hallucination, coverage, and precision.
- Metrics: accuracy, missing (not-given) rate, hallucination rate. Omni-Captioner attains 53.5% accuracy overall (50.2% visual, 52.8% audio, 62.7% audio-visual).
Human correlation of the Omni-Cloze metric is robust (Pearson’s on VDC benchmark), validating it as an objective evaluation for detailed perception across modalities.
5. Empirical Findings and Comparative Performance
Key findings include:
- The agentic, tool-calling iterative data pipeline enables longer, more informative captions (audio–visual average 1125 words per segment) with decreasing hallucination rates (e.g., video-SALMONN 2 hallucination rate 10.9%) as compared to previous OLMs.
- Omni-Captioner sets a new state-of-the-art on VDC (55.0% accuracy/2.7 score), and provides the best detail/hallucination balance on video-SALMONN 2 (missing 17.8% / hallucination 10.9%).
- Existing models, proprietary or open-source, are outperformed on both open (MMAU, MMAR) and detailed (VDC, Omni-Cloze) benchmarks.
These results demonstrate that the Omni-Detective paradigm significantly advances the ability of LLMs to produce high-factuality, fine-detailed captions in both unimodal (audio) and multimodal (audio–visual) domains.
6. Implications, Applications, and Future Directions
Omni-Detective and its associated models and benchmarks are directly relevant for domains that demand high-fidelity, grounded, fine-grained multimodal perception:
- Assistive and accessibility technologies: Comprehensive, accurate descriptions for visually or hearing-impaired users.
- Scientific and legal reporting: High-stakes fields requiring precise, low-hallucination event description across modalities (e.g., medical imaging, surveillance analysis, evidence reporting).
- Autonomous agents and robotics: Systems that must interpret and report on complex, real-world environments, where incomplete or inaccurate summaries may be dangerous.
- AI system evaluation: Omni-Cloze and similar benchmarks provide reliable, detailed-oriented assessment correlating with human judgment while reducing computational expense.
Future research outlined in the paper targets further minimization of hallucinations, expansion to additional modalities or tool integrations, greater agent autonomy, and real-time deployment in production environments.
Omni-Detective synthesizes agentic investigation, tool-calling, and cross-modal observer strategies to deliver robust, fact-rich detailed perception and captioning. Through its innovative pipeline, curriculum-based model training, and comprehensive evaluation, it establishes a new foundation for fine-grained, trustworthy multimodal AI understanding (Ma et al., 14 Oct 2025).