Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

Published 27 Apr 2026 in cs.CV | (2604.24191v1)

Abstract: Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a novel 'Think-with-Omni' paradigm that uses recursive tree search for efficient and deliberative audio-visual reasoning.
It employs a two-stage training process combining SFT and RL, with an MCTS-driven data engine to curate high-quality multi-turn reasoning trajectories.
Experimental results demonstrate significant performance gains across 11 benchmarks, outperforming both open-source and closed-source audio-visual systems.

Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning: Omni-o3

Motivation and Framework Overview

Omnimodal reasoning with MLLMs exposes the model to a massive, highly redundant search space dominated by complex cross-modal interactions, demanding both efficiency and deliberative reasoning. Prevailing paradigms—linear stepwise (CoT) or parallel rollout—lack mechanisms for intermediate sharing and self-correction, which results in inefficient exploration and compounding errors in dense, multi-turn audio-visual tasks.

Omni-o3 introduces a fundamentally distinct paradigm ("Think-with-Omni"), elevating shallow verbal CoT protocols into a deliberative omnimodal tree search structure. By embedding omni skills and recursive deduction within multi-round reasoning, Omni-o3 overcomes the limitations of direct-response and isolated CoT models, inherently sharing omnimodal reasoning prefixes across branches to maximize computational efficiency.

Figure 1: Paradigm comparison—Omni-o3’s Think-with-Omni enables deliberative omnimodal reasoning beyond direct response and verbal CoT.

Deep Nested Deduction: Methodology

Omni-o3 formulates its reasoning policy as a dynamic recursive search, structuring the reasoning process into hierarchical trees. The model executes four atomic cognitive actions: expansion (generating new thoughts/sub-goals), selection (branch preference), simulation (invoking intrinsic omnimodal skills), and backpropagation (updating node values). This nested deduction enables iterative exploration, backtracking, and multi-turn self-verification, substantially reducing redundant multimodal encoding and shifting computational focus to actual cognitive exploration.

The overall pipeline is decoupled into two progressive optimization stages:

Stage 1 — SFT Cold-Start Training: The policy model is fine-tuned to internalize the structural syntax of nested reasoning and bootstrap intrinsic omnimodal capabilities using 101K distilled, rationale-grounded trajectories.
Stage 2 — RL Exploratory Training: The model autonomously explores reasoning trajectories formulated as dynamic tree search over 18K complex, multi-turn samples, supervised by a multi-step reward model that delivers dense, intermediate feedback for outcome, structural prefix, child node logic, and rigorous format adherence.
Figure 2: Omni-o3’s Deep Nested Deduction pipeline: SFT cold-start training and RL exploratory optimization with tree-search formulation.

Automated Data Engine: MCTS-driven Reasoning Trajectory Curation

To address the scarcity and complexity of multi-turn omnimodal reasoning data, an automated MCTS-driven data engine is deployed. This pipeline iteratively constructs reasoning trees from a vast pool of 3.5M samples, employing selection, expansion, simulation (with specialized solvers), and backpropagation. The result is a high-quality dataset: 101K sampled trajectories for SFT and 18K diverse, challenging cases for RL, spanning deep conversational structures, dense multimodal combinations, and extended context lengths.

Figure 3: MCTS-driven reasoning trajectory curation yields high-quality, diverse audio-visual data essential for SFT and RL training.

Comprehensive data statistics show deep conversational turn lengths, rich multimodal sources (Audio, Video, Image, their combinations), and demanding long-context requirements up to 20K tokens, ensuring robust training for system-2 multimodal reasoning.

Figure 4: Omni-o3 training data encompass diverse sources, deep annotation structures, and long-context token statistics.

Experimental Results: Benchmarks and Performance

Omni-o3 exhibits substantial gains across 11 benchmarks in video, audio, and audio-visual modalities:

Audio-Visual reasoning: On Video-MME (75.7) and AVE (85.1), Omni-o3 outperforms Qwen3-Omni (+5.2/+4.3). On complex reasoning benchmarks (Video-Holmes, WorldSense, IntentBench), it consistently surpasses open-source and even closed-source APIs (e.g., 54.0 on Video-Holmes vs. GPT-4o's 42.0; 67.6 on IntentBench vs. GPT-4o's 60.0).
Video reasoning and grounding: Omni-o3 achieves robust spatial-temporal comprehension (Vript-RR = 75.0, FineVD Most = 53.2), substantially outpacing omnimodal baselines.
Audio reasoning: In QualiSpeech (0.492), NISQA (0.82), MMSU (70.4), Omni-o3 delivers strong performance, surpassing specialized models and matching or exceeding closed-source systems on reasoning scores.

These results validate the nested deduction paradigm’s ability to bridge the gap between raw perception and high-level cognitive reasoning, achieving state-of-the-art performance in competitive benchmarks.

Qualitative Visualization: Deliberative Reasoning Trajectories

Visualizations demonstrate Omni-o3’s capability for both foundational and advanced reasoning. In single-turn spatial grounding, the model aligns auditory cues with visual entities and outputs precise coordinates. In complex multi-turn temporal grounding, it dynamically constructs a reasoning tree, iteratively self-corrects based on internal valuations, and refines its deductions until the correct event is localized.

Figure 5: Visualization—Omni-o3 showcases accurate spatial grounding in single-turn reasoning and robust self-correction in multi-turn video tasks.

Practical and Theoretical Implications

Omni-o3 establishes a paradigm shift for MLLMs in omnimodal contexts, marrying fine-grained perception and system-2 deduction with interpretable, structured reasoning. Practically, the model offers substantial improvements for embodied agents, audio-visual QA systems, and autonomous cognitive exploration in rich multimodal environments. Theoretically, tree-based reasoning and multi-step reward modeling foreground new approaches for long-context exploration, modality interference mitigation, and traceable agent policy learning.

Limitations include susceptibility to hallucinations in noisy contexts and incomplete support for real-time interaction. The framework provides a robust foundation for future extensions toward dynamic, embodied, and real-time omnimodal agents.

Conclusion

Omni-o3 integrates deep nested deduction and advanced data curation, yielding state-of-the-art performance across audio-visual, visual, and audio benchmarks. Its recursive, tree-based reasoning fundamentally augments omnimodal cognition, mitigates inefficiencies, and enables interpretable, multi-turn deduction. This paradigm supports further development of robust, high-capacity omnimodal LLMs and agentic reasoning frameworks (2604.24191).

Markdown Report Issue