Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

Published 14 Apr 2026 in eess.AS | (2604.12527v2)

Abstract: Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio LLMs (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are often inconsistent and insufficient for complex tasks. To bridge this gap, we introduce Audio-Cogito, a fully open-source solution for deep audio reasoning. We develop Cogito-pipe for high-quality audio reasoning data curation, producing 545k reasoning samples that will be released after review. Based on this dataset, we adopt a self-distillation strategy for model fine-tuning. Experiments on the MMAR benchmark, the only audio benchmark evaluating the CoT process, show that our model achieves the best performance among open-source models and matches or surpasses certain closed-source models in specific metrics. Our approach also ranks among the top-tier systems in the Interspeech 2026 Audio Reasoning Challenge.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a novel Cogito-Pipe pipeline that systematically curates high-quality audio reasoning data for LALM training.
It employs self-distillation with chain-of-thought tracing to enhance interpretability and accuracy, outperforming existing models on the MMAR benchmark.
The open-source framework bridges the performance gap with proprietary systems, enabling advanced and explainable audio reasoning in varied scenarios.

Audio-Cogito: An Open-Source Framework for Deep Audio Reasoning

Introduction

Audio-Cogito introduces a systematic methodology for constructing and training Large Audio LLMs (LALMs) with enhanced deep reasoning capabilities over audio input. Existing LALMs underperform in complex reasoning, mainly due to the absence of well-annotated reasoning datasets and methodologies closely integrated with model architectures. Audio-Cogito addresses these limitations by proposing a comprehensive data curation pipeline—Cogito-Pipe—and a training regimen leveraging self-distillation with large-scale, high-quality audio reasoning tasks. Results on the MMAR benchmark and Interspeech 2026 Audio Reasoning Challenge confirm this framework as a state-of-the-art open-source system for deep audio reasoning, demonstrating robust performance across diverse domains and superior logical interpretability in Chain-of-Thought (CoT) settings.

Cogito-Pipe: Systematic Data Pipeline for Audio Reasoning

The core of the Audio-Cogito approach is Cogito-Pipe, a four-stage pipeline designed for scalable generation of high-fidelity audio reasoning datasets, which is critical for advancing model reasoning depth and stability.

Cogito-Pipe comprises:

Data Collection: Amasses large-scale, multi-domain audio data (sound events, speech, and music), with a focus on mixed/interleaved acoustic scenes. Metadata enrichment ensures contextual richness for downstream QA tasks.
QA Construction: Utilizes Qwen3-Omni-Instruct for instructive QA generation, guided by a curated seed pool of ~500 diverse, expert-refined questions. The process integrates confusing distractors for robust negative sampling and supports one-to-many QA associations per audio segment.
CoT Generation: Implements a self-distillation strategy via Qwen3-Omni-Thinking to generate free-form reasoning traces. Ground-truth answers are withheld to enforce answer derivation strictly from acoustic evidence, fostering intrinsic logical grounding.
Quality Verification: Enforces quality through dual-stage auditing: a QA-consistency check and LLM-based logic judgment (LLM-as-a-Judge), effectively filtering out hallucinations and inconsistencies.

The interchangeability of pipeline components enables extensibility and compatibility with evolving LALM architectures.

Figure 1: Cogito-Pipe overview, detailing the staged process for generating multimodal, high-quality audio reasoning datasets.

Model Architecture and Training

Audio-Cogito leverages Qwen3-Omni-Thinking (30B parameters) as its backbone, with LoRA-based supervised fine-tuning on the Cogito-Pipe dataset. Each training sample integrates audio input, text query, structured reasoning trace (CoT), and ground truth answer. The training objective maximizes the joint log-likelihood of reasoning trace and answer, explicitly aligning model behavior with human-like problem-solving protocols. This structuring ensures both answer accuracy and transparent, logically coherent intermediate reasoning steps.

Evaluation Protocols

A key innovation in this study is the focus on explainable audio reasoning rather than mere outcome accuracy. The MMAR benchmark is employed, prioritizing evaluation of the intermediate CoT process. Metrics include:

Avg (Average Accuracy): Proportion of QA pairs with correctly predicted answers.
Rubrics Score: Assessment of stepwise CoT validity, based on LLM judge evaluation of rubric criteria derived from ground-truth reasoning.
CRS (Correct Reasoning Score): Average rubric score conditioned on correct answers, quantifying logical rigor in successful predictions.

This framework reveals not just task performance but the validity and reliability of the model’s reasoning chain.

Experimental Results

Audio-Cogito outperforms all open-source models across LALM, OLM, and LARM classes on MMAR, demonstrating robust generalization to single and mixed-domain audio reasoning tasks.

Notably:

Superior accuracy: Audio-Cogito exceeds Qwen3-Omni-Thinking by 5.44% (relative) in average accuracy and delivers especially strong gains in mixed-domain inference.
Narrowing the proprietary gap: Performance matches or exceeds leading closed-source systems (e.g., Gemini 2.0 Flash, Gemini 2.5 Flash, Omni-R1, GPT-4o Audio) in several metrics and domains, and nearly matches Gemini 2.5 Pro in complex scenarios.
Robust logical quality: Achieves SOTA Rubrics and CRS among LARMs, underscoring holistic, logically sound CoT reasoning.

Ablation studies confirm that all Cogito-Pipe components—seed question pool, meta-information enrichment, and stringent quality verification—are indispensable, with the removal of any stage resulting in a quantifiable performance and reasoning degradation.

Implications and Future Directions

Audio-Cogito’s open-source, modular design catalyzes further research by providing a reproducible, extensible pipeline for deep audio reasoning. The demonstrable improvements in reasoning trace quality, particularly with self-distillation, underscore the necessity of closely aligned data construction and training strategies for robust LALM development.

Practically, Audio-Cogito can underpin advanced audio agents requiring interpretable and reliable decision-making—such as clinical audio diagnostics, intelligent assistant systems, and audio scene interpretation. Theoretically, the findings support the centrality of CoT and rigorous data curation in transferring LLM-style reasoning to complex, high-variance acoustic environments.

Moving forward, further scaling of dataset diversity, domain coverage, and model parameterization—combined with improved rubric construction protocols and adversarial QA augmentation—can be expected to enhance LALM robustness and generalizability in unconstrained, real-world audio reasoning tasks.

Conclusion

Audio-Cogito establishes a new state-of-the-art in open-source audio reasoning, showcasing the criticality of systematic, high-quality data pipelines and self-consistent CoT learning frameworks. The approach bridges the gap between proprietary and open models in deep audio understanding, validating the transferability of CoT paradigms to the audio modality. Its practical and theoretical contributions set the foundation for the next generation of explainable, multimodal reasoning systems in speech and audio AI.

Markdown Report Issue