AudioGenie-Reasoner (AGR) Audio Deep Reasoning

Updated 24 September 2025

The paper introduces a training-free multi-agent system that decouples audio perception from high-level reasoning via iterative text refinement.
It leverages an audio captioning module and LLM-based agents in a 'diagnose-plan-act' loop to progressively build a robust chain of evidence.
Empirical results on MMAU-mini and MMAR benchmarks show that AGR outperforms established models through optimized iterative evidence gathering and tool augmentation.

AudioGenie-Reasoner (AGR) is a unified, training-free multi-agent system for audio deep reasoning that integrates expert-level perceptual analysis with multi-step logical inference by coordinating specialized agents over an evolving chain of textual evidence. AGR introduces a paradigm shift that transforms audio deep reasoning into a complex text understanding task, leveraging the capabilities of LLMs on top of robust audio captioning modules. Its architecture mimics the human coarse-to-fine cognitive process via a proactive iterative document refinement loop, facilitating active exploration, tool-augmented evidence gathering, and continual refinement of reasoning chains until sufficient information is acquired to answer challenging audio-related queries (Rong et al., 21 Sep 2025).

1. System Architecture and Agent Roles

AGR decouples the audio perception task from high-level reasoning operations through a modular, multi-agent system. The architecture consists of:

Audio Captioning Module (ALLM): Converts the raw input audio $A$ into a coarse-grained, text-based description $D_0 = \mathcal{A}_{\text{caption}}(A)$ . This document provides an initial substrate for later reasoning.
Agents: Four principal LLM-based agents operate in an iterative “diagnose-plan-act” loop:
- Planning Agent: Examines the current document, answer list, and analysis history to decide whether available evidence suffices for the given question.
- Interaction Agent: Formulates a structured plan $P$ to address evidence gaps if the Planning Agent deems current evidence insufficient.
- Augmentation Agent: Executes tool-based operations such as audio question-answering, guided re-captioning, or ASR (e.g., using Whisper-Turbo), generating new evidence to update the document.
- Answering Agent: Once evidence is considered sufficient, synthesizes the final answer $A^*$ , confidence score $S_c$ , and a textual rationale $R$ .

The core workflow is an iterative loop that incrementally refines the document, integrating new evidence and reevaluating sufficiency at each step.

2. Paradigm Shift: Audio Reasoning as Text Understanding

AGR’s principal innovation is redefining audio deep reasoning as a text-centric problem:

Audio-to-Text Transformation: Audio signals are mapped to a textual domain using high-performance ALLMs, sidestepping the need for large audio datasets annotated with reasoning chains.
Textual Chain-of-Evidence: Perceptual and contextual information extracted in coarse form is progressively augmented into a comprehensive chain of evidence, suitable for LLM-based reasoning. As a result, AGR delegates complex cognitive tasks to established LLMs that possess advanced chain-of-thought and logical reasoning skills.

This strategy leverages the advancements in LLM reasoning without the prohibitively high cost of dedicated audio-reasoning datasets.

The heart of AGR is its document refinement loop, which maintains and augments a text-based evidence chain:

Iteration Dynamics: At each iteration, the Planning Agent evaluates sufficiency using

$(s, H_{i+1}) = \mathcal{A}_{\text{plan}}(Q, L, D_i, H_i)$

where $s \in \{\text{Sufficient}, \text{Insufficient}\}$ and $H_i$ is the analysis history.

Augmentation Planning and Execution: If $s = \text{Insufficient}$ , the Interaction Agent constructs a plan $P = \mathcal{A}_{\text{interact}}(D_i, H_{i+1})$ dictating which evidence-augmentation tool to apply. The Augmentation Agent then generates new evidence $E_{\text{new}}$ executed as:

$D_{i+1} = D_i \oplus E_{\text{new}}$

where $\oplus$ represents integration into the evolving document.

Termination and Answer Synthesis: The loop terminates when $s = \text{Sufficient}$ or a predefined iteration limit is reached. The Answering Agent returns:

$(A^*, S_c, R) = \mathcal{A}_{\text{answer}}(D_f, Q, L)$

where $D_f$ is the final evidence chain.

Empirical analysis demonstrates that optimal performance is achieved after two iterations on MMAU-mini and three on MMAR, with further refinement beyond these thresholds introducing noise and degrading performance.

4. Benchmark Performance and Ablation

AGR exhibits state-of-the-art performance on established audio reasoning benchmarks:

MMAU-mini: With 1,000 closed-form questions spanning sound, music, and speech modalities, AGR outperforms all open-source models and proprietary baselines such as Gemini-2.5-Flash and Gemini-2.0-Flash.
MMAR Benchmark: Covering both single and mixed audio type reasoning, AGR significantly surpasses open-source competitors and approaches the performance of leading proprietary systems such as Gemini-2.0-Flash-Lite.

Ablation studies confirm the criticality of the proactive iterative document refinement strategy. Excluding this loop consistently reduces performance, particularly on more challenging tasks. The use of more capable LLMs (e.g., GPT-4o versus GPT-3.5-turbo) also yields substantial benefits.

Performance Table (metrics trace to original experiments):

Benchmark	AGR Open-Source Rank	Strongest Baselines Outperformed
MMAU-mini	1st	Gemini-2.5-Flash, Gemini-2.0-Flash
MMAR	1st (open-source)	Open-source (by significant margin)

5. Technical Methodology

Key mechanisms in AGR’s methodology include:

Document Formation: $D_0 = \mathcal{A}_{\text{caption}}(A)$ , formalizing the reduction of audio reasoning to the textual domain.
Iterative Sufficiency Checking: Each cycle assesses evidence completion using the Planning Agent, with explicit state tracking via $(s, H_{i+1}) = \mathcal{A}_{\text{plan}}(Q, L, D_i, H_i)$ .
Tool-Augmented Augmentation: Repeated invocation of ASR, QA, or re-captioning tools ensures that evidence gathering is both flexible and context-sensitive.
Answer Generation: $(A^*, S_c, R) = \mathcal{A}_{\text{answer}}(D_f, Q, L)$ , delivering predictions grounded in a causally traceable chain-of-evidence.

Performance trends indicate diminishing returns and even degradation when the iterative loop is unbounded; optimal iteration counts are empirically determined.

6. Implications and Future Directions

AGR’s transformation of audio reasoning into a text-centric paradigm and its coarse-to-fine, multi-agent refinement loop have several implications:

Generalization to New Audio Tasks: By fully leveraging the reasoning capacity of modern LLMs and modular ALLMs, AGR can be readily adapted to emerging domains without prohibitively expensive annotation efforts.
Modular Enhancement: The decoupling of perception and cognition enables independent improvements in either subsystem—for example, fine-tuning audio captioners or introducing new evidence-generation tools to enhance acoustic detail extraction.
Active Exploration Paradigm: The "diagnose-plan-act" iterative loop provides a systematic, self-improving template for reasoning in other multimodal domains, suggesting potential for applications in embodied intelligence and autonomous systems.
Research Foundation: By achieving strong results on standard benchmarks and highlighting the value of proactive, self-correcting refinement, AGR establishes a foundation for future advancements in interpretable, robust audio deep reasoning.

In sum, AudioGenie-Reasoner (AGR) constitutes a substantial advance in the field of audio reasoning by integrating modular audio perception with text-based iterative refinement, coordinated by a suite of LLM-driven agents. This framework achieves high-level performance while offering extensibility for further research and practical deployment in complex audio understanding settings (Rong et al., 21 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning (2025)

Follow Topic

Get notified by email when new papers are published related to AudioGenie-Reasoner (AGR).

AudioGenie-Reasoner (AGR) Audio Deep Reasoning

1. System Architecture and Agent Roles

2. Paradigm Shift: Audio Reasoning as Text Understanding

3. Proactive Iterative Document Refinement

4. Benchmark Performance and Ablation

5. Technical Methodology

6. Implications and Future Directions

Follow Topic

Continue Learning

AudioGenie-Reasoner (AGR) Audio Deep Reasoning

1. System Architecture and Agent Roles

2. Paradigm Shift: Audio Reasoning as Text Understanding

3. Proactive Iterative Document Refinement

4. Benchmark Performance and Ablation

5. Technical Methodology

6. Implications and Future Directions

Follow Topic

Continue Learning

Related Topics