RSA-Bench: Evaluating Audio Model Robustness
- RSA-Bench is an ecologically grounded benchmark that simulates realistic, multi-layered acoustic interference by mixing clean speech with up to four energy-matched environmental noise tracks.
- It systematically evaluates Audio Large Models on perceptual tasks (ASR, gender, emotion) and cognitive tasks (math reasoning, QA, instruction following) using quantitative performance metrics.
- Empirical insights reveal a perceptual-cognitive robustness gap, scenario-specific sensitivity to noise types, and a denoising paradox where standard enhancement methods may degrade model performance.
RSA-Bench is a large-scale, ecologically grounded benchmark designed to rigorously evaluate the robustness of Audio Large Models (ALMs) under realistic, multi-layered acoustic interference. Developed in response to the limitations of synthetic noise-based evaluations, RSA-Bench superimposes diverse real-world environmental soundscapes onto clean speech signals, systematically varying the complexity of acoustic interference and quantitatively assessing ALM performance across a spectrum of perceptual and cognitive tasks (Zhang et al., 15 Jan 2026).
1. Definition and Objectives
RSA-Bench (Real-world Scenarios for Audio Large Model Benchmarking) aims to expose and quantify the brittleness of ALMs in authentic acoustic ecologies. Unlike traditional benchmarks using artificial Gaussian noise or simplistic single-source interference, it constructs high-fidelity auditory stimuli by mixing clean speech with up to four distinct, energy-matched environmental sound tracks. This design enables systematic characterization of model degradation as “acoustic stress” increases. The primary metrics include preservation of speech intelligibility (Automatic Speech Recognition, ASR), speaker trait recognition (gender), paralinguistic cue detection (emotion), and high-level spoken reasoning abilities (arithmetic, QA, instruction following), as well as model sensitivity to environmental noise typology and denoising pre-processing (Zhang et al., 15 Jan 2026).
2. Construction Methodology
Each noisy sample consists of a clean speech signal (length ) and real-world noise clips (length ). The generation protocol encompasses:
- Temporal alignment via modular tiling/truncation:
ensuring all noise clips match the target utterance length.
- RMS energy normalization:
to fix the nominal signal-to-noise ratio (SNR) at approximately $0$ dB.
- Linear superposition and hard clipping:
or, in continuous time, .
By systematically varying across four acoustic scenarios, RSA-Bench generates 16 noisy conditions plus the clean reference for every tested utterance, resulting in over 100,000 total test clips (Zhang et al., 15 Jan 2026).
3. Environmental Soundscapes
RSA-Bench models interference using four empirically selected environmental scenarios:
| Scenario | Interference Type |
|---|---|
| Pasture | Non-stationary animal vocalizations (cows, dogs, chickens, sheep) |
| Extreme Weather | Broadband mechanical noise (rain, wind), intermittent thunder, wind chimes |
| Classroom | Periodic acoustic gaps (clock ticks, typing, coughing, sipping) |
| Outdoors | Unstructured "vocal-like" sounds (children, birds), along with streams, footsteps |
Each noise source is RMS-aligned, ensuring that modeled differences isolate the effects of interleaved interference structure, not just energy or loudness. The interference count controls ecological “stress.”
4. Task Taxonomy and Scoring Metrics
RSA-Bench covers six core tasks, partitioned into perception/paralinguistics and cognitive reasoning:
A. Perception & Paralinguistics
- Automatic Speech Recognition (ASR): Transcribe to text. Metric: Word Error Rate (WER).
where , , are the counts of substitutions, deletions, insertions, respectively.
- Gender Recognition (GR): Binary speaker gender (on IEMOCAP). Metric: Accuracy, .
- Emotion Recognition (ER): Prosodic/affective state on MELD. Metric: LLM-judge score (0–100 scale).
B. Cognitive Reasoning
- Mathematical Reasoning (MR): Extract spoken numbers, compute results (SpokenMQA). Metric: Exact match accuracy.
- Speech QA (SQA): Answer factoid questions about passages (SLUE Phase-2). Metric: LLM-judge score or Mean Reciprocal Rank (MRR).
- Speech Instruction Following (SI): Follow open-ended audio instructions (OpenHermes). Metric: LLM-judge compliance.
Semantic tasks are scored using “LLM-as-a-judge” annotation (e.g., GPT-4o-mini), generating a normalized [0,100] metric.
5. Empirical Insights and Findings
RSA-Bench yields three pivotal findings (Zhang et al., 15 Jan 2026):
- Perception-Cognition Robustness Gap: Perceptual tasks (ASR, GR, ER) degrade sub-linearly with increased , e.g., GR remains above 80% accuracy at . In contrast, cognitive tasks (MR, SQA, SI) exhibit functional collapse: mathematical accuracy and instruction/QA scores plummet as rises, displaying almost exponential performance drop. For example, MR accuracy approaches zero by , and SI/SQA scores fall from 80 to 20.
- Scenario Sensitivity: "Vocal-like" backgrounds (Outdoors) are the most destructive. For Qwen3-Omni at , ASR WER in Outdoors jumps to 259.56%, while in Classroom it is only 4.77%. Statistical t-tests confirm Outdoors significantly reduces both ASR and semantic task scores versus mechanical or sparser noise (p < 0.01).
- The Denoising Paradox: Applying off-the-shelf speech enhancement (Noisereduce, RNNoise, Audio-Denoise, DeepFilterNet) seldom improves and often worsens ALM performance. For most tasks, . This is attributed to denoising-induced spectral and temporal distortions, which, while reducing energetic noise, introduce semantic artifacts detrimental to ALMs.
6. Design Recommendations and Future Directions
Based on observed model fragility under ecologically valid interference, the following strategies are recommended (Zhang et al., 15 Jan 2026):
- Noise-aware instruction tuning: Fine-tune ALMs using multi-source, ecologically valid noise mixtures to promote modality-invariant representations.
- Adversarial training with real-world noise: Train with mixtures across interference sources to flatten the robustness curve for complex tasks.
- Integrated denoising front-ends: Jointly optimize denoising and downstream ALM objectives, rather than relying on fixed DSP pipelines.
- Scenario-adaptive attention mechanisms: Develop modules that can detect and dynamically downweight "vocal-like" nuisance signals, modulating internal focus appropriately.
A plausible implication is that robust audio understanding in the wild will require not only architectural innovations but also fundamentally new training paradigms exploiting the kind of stress gradients instantiated by RSA-Bench.
7. Significance and Impact
RSA-Bench represents a transition to ecologically valid, stress-driven evaluation of ALMs, revealing brittleness not exposed by conventional synthetic noise protocols. It provides a large-scale, reproducible foundation for both diagnostic and training interventions aimed at achieving intrinsic robustness for models deployed in real-world auditory environments. The comprehensive public dataset, codebase, and systematic methodology facilitate broad adoption for model development and cross-benchmarking within the speech and audio research community (Zhang et al., 15 Jan 2026).