AuditoryBench++: Auditory Commonsense Evaluation

Updated 23 September 2025

AuditoryBench++ is a comprehensive framework that evaluates language models’ auditory reasoning via five rigorously curated tasks.
It integrates tasks on pitch, duration, loudness, animal sound recognition, and contextual reasoning using systematic validation methods.
Employing AIR-CoT, the framework injects auditory knowledge directly into the inference process, achieving significant performance gains.

AuditoryBench++ is a comprehensive benchmark and methodological framework designed to evaluate and enhance the auditory commonsense and reasoning abilities of LLMs in text-only settings, simulating the human ability to reason about sound without directly hearing it. It enables fine-grained analysis across basic auditory comparisons and contextually grounded semantic reasoning, serving as both an evaluation suite and a catalyst for the development of novel auditory imagination methods.

1. Definition and Scope

AuditoryBench++ operationalizes auditory commonsense evaluation via five primary tasks:

Pitch Comparison: Binary decisions based on instrument pairs, rigorously sourced and filtered to guarantee objective pitch differentials.
Duration Comparison: Length-based discrimination, constructed using segment annotations and outlier filtering from AudioTime.
Loudness Comparison: Identification of the louder sound using peak decibel statistics for unambiguous scoring.
Animal Sound Recognition: Multiple-choice format matching onomatopoeic expressions to animal categories, carefully verified for answer integrity.
Auditory Context Reasoning: Multiple-choice tasks replicating nuanced auditory cues recontextualized into text, adapted from MMAU audio questions where captions are generated and refined via Qwen2-Audio and GPT-4o.

Each benchmark subset is created via a multi-stage curation pipeline—statistical filtering, manual validation, and targeted rewriting—to preserve both basic perceptual and deeper contextual reasoning properties. This diversification addresses core auditory attributes (such as pitch, duration, amplitude) as well as semantic and situational inference.

2. Relationship to Previous Benchmarks and Models

Earlier efforts such as AuditoryBench (Ok et al., 12 Sep 2024) focused on animal sound recognition and pitch comparison using queries constructed from LAION-Audio-630K, but these were limited in scope and primarily assessed elementary associations or single-dimension properties. AuditoryBench++ expands this with multi-task coverage and context-sensitive reasoning, aligning its construction principles with recommendations from auditory modeling literature (Vecchi et al., 2021)—notably, stage standardization, stimulus diversity, and carefully calibrated output conventions for comparability across models.

The benchmark also incorporates methodologies inspired by effective and biophysical auditory models, maintaining the modular evaluation of processing stages such as outer/middle ear simulation, cochlear filtering, and subcortical neural modeling, where applicable, to facilitate future integration of physiologically detailed systems.

3. Auditory Imagination Reasoning: AIR-CoT

AuditoryBench++ introduces AIR-CoT (“Auditory Imagination Reasoning—Chain of Thought”), a method for dynamic auditory knowledge generation and integration during inference:

Span Detection (Stage 1): The model is fine-tuned via supervised learning to emit special tokens ([imagine], [/imagine]) marking text spans that require auditory knowledge for correct inference. The loss function during this stage operates exclusively on these special tokens, guiding reliable span detection without affecting answer production.
Knowledge Injection (Stage 2): Upon encountering a closing token, generation halts for auditory embedding injection. Pre-trained audio models (e.g., CLAP) sample and encode sound representations, which are then projected to match the LLM's dimension via a two-layer MLP:

$f_\text{imagined} = \text{MLP}(f_\text{CLAP})$

These adapted auditory features are inserted at the corresponding span within the chain-of-thought hidden states, enabling end-to-end fusion. The final training loss operates solely on target answer tokens.

AIR-CoT distinguishes itself from prior cascade-based augmentation (e.g., Imagine to Hear (Yoo et al., 21 Mar 2025)) by integrating span detection and knowledge injection seamlessly rather than as sequential modules, yielding greater coherence and adaptability in multimodal reasoning.

4. Experimental Evaluation and Model Performance

Extensive benchmarking compared AIR-CoT against contemporary LLMs (LLaMA3.1, Qwen2.5, Phi4-mini), multimodal LLMs (Qwen2-Audio, Phi4-MM), and earlier auditory augmentation approaches (AudioBERT (Ok et al., 12 Sep 2024), Imagine to Hear (Yoo et al., 21 Mar 2025)). Key findings include:

Off-the-shelf text-only models perform at near random accuracy across most tasks.
AudioBERT yields incremental improvements through retrieval-based knowledge injection, while Imagine to Hear delivers stronger results via generative auditory knowledge and fusion.
AIR-CoT outperforms these on pitch comparison, animal sound recognition, and context reasoning tasks, with accuracy gains consistently in the range of +8 to +12 points over previous bests. Performance gains on duration and loudness tasks are less pronounced, attributed to the primarily semantic (not quantitative) encoding of audio features.

All experiments utilize robust cross-validation with standardized dataset splits and reporting of absolute accuracy differences across condition (development/test/Wiki), substantiating claims of superior auditory reasoning performance.

5. Methodological Best Practices and Modular Integration

AuditoryBench++ draws on best-practice guidelines from comparative auditory model research (Vecchi et al., 2021) to ensure methodological rigor:

Processing Stage Standardization: Modular “building blocks” map onto biological analogues, enabling uniform evaluation and model adaptation.
Benchmarking and Output Calibration: Metrics such as filter tuning (ERB numbers), AC/DC phase locking ratios, and adaptation curves inform both model evaluation and reporting.
Level and Unit Calibration: All inputs and outputs conform to well-defined conventions (Pa, dB SPL, spikes/s, microvolts, model units) for interoperability.
Stimulus and Parameter Selection: Choice of stimulus (transient vs. steady, tonal vs. noise) and discretization of filter channels are parameterized to expose model behaviors under varied acoustic scenarios.
Framework Extensibility: Support for impaired-hearing scenarios and modular evaluation facilitate comparison across normal and pathological states.

These practices underpin the reproducibility and reliability of AuditoryBench++, positioning it to serve a broad spectrum of auditory modeling applications, from psychoacoustic investigation to speech/intelligibility prediction.

6. Applications and Prospects

AuditoryBench++ and AIR-CoT demonstrate significant utility for both diagnostic and generative tasks within language modeling:

Multimodal Interaction: Models enhanced via AIR-CoT can simulate human-like auditory reasoning in customer service, virtual assistants, accessible technology, or content generation, reflecting nuanced sound properties even without direct audio input.
Narrative and Context Generation: Auditory imagination capacities allow for richer storytelling and environmental simulation, enabling LLMs to produce contextually appropriate auditory descriptions.
Extensible Task Coverage: The modular task structure can be further expanded to incorporate additional aspects of auditory reasoning—source localization, speech segregation, or environment simulation—critical for both cognitive modeling and practical deployment.

7. Future Research Directions

Key directions proposed for advancing AuditoryBench++ include:

Enhancing Representation: Addressing current limitations in capturing temporal and amplitude properties through new audio representation methods.
Broader Multimodal Fusion: Integrating auditory imagination alongside other modalities (e.g., vision) for more sophisticated chain-of-thought reasoning.
Diverse Task Expansion: Increasing the breadth and nuance of benchmark tasks to further differentiate model capabilities and inform architectural innovation.
Unified End-to-End Architectures: Exploring joint optimization protocols for harmonized multimodal reasoning that balances semantic, temporal, and quantitative knowledge streams.

This suggests that AuditoryBench++ is both a state-of-the-art evaluation and development framework, catalyzing progress in the integration of auditory commonsense into LLMs and multimodal systems. Its future iterations may become central to the diagnosis and deployment of robust, physiologically plausible, and contextually aware AI agents in domains where auditory reasoning is essential.