RSA-Bench: Evaluating Audio Model Robustness

Updated 22 January 2026

RSA-Bench is an ecologically grounded benchmark that simulates realistic, multi-layered acoustic interference by mixing clean speech with up to four energy-matched environmental noise tracks.
It systematically evaluates Audio Large Models on perceptual tasks (ASR, gender, emotion) and cognitive tasks (math reasoning, QA, instruction following) using quantitative performance metrics.
Empirical insights reveal a perceptual-cognitive robustness gap, scenario-specific sensitivity to noise types, and a denoising paradox where standard enhancement methods may degrade model performance.

RSA-Bench is a large-scale, ecologically grounded benchmark designed to rigorously evaluate the robustness of Audio Large Models (ALMs) under realistic, multi-layered acoustic interference. Developed in response to the limitations of synthetic noise-based evaluations, RSA-Bench superimposes diverse real-world environmental soundscapes onto clean speech signals, systematically varying the complexity of acoustic interference and quantitatively assessing ALM performance across a spectrum of perceptual and cognitive tasks (Zhang et al., 15 Jan 2026).

1. Definition and Objectives

RSA-Bench (Real-world Scenarios for Audio Large Model Benchmarking) aims to expose and quantify the brittleness of ALMs in authentic acoustic ecologies. Unlike traditional benchmarks using artificial Gaussian noise or simplistic single-source interference, it constructs high-fidelity auditory stimuli by mixing clean speech with up to four distinct, energy-matched environmental sound tracks. This design enables systematic characterization of model degradation as “acoustic stress” increases. The primary metrics include preservation of speech intelligibility (Automatic Speech Recognition, ASR), speaker trait recognition (gender), paralinguistic cue detection (emotion), and high-level spoken reasoning abilities (arithmetic, QA, instruction following), as well as model sensitivity to environmental noise typology and denoising pre-processing (Zhang et al., 15 Jan 2026).

2. Construction Methodology

Each noisy sample consists of a clean speech signal $s[n]$ (length $N$ ) and $K$ real-world noise clips $w_k[n]$ (length $M_k$ ). The generation protocol encompasses:

Temporal alignment via modular tiling/truncation:

$\tilde{w}_k[n] = w_k[n~\text{mod}~M_k], \quad 0 \leq n < N,$

ensuring all noise clips match the target utterance length.

RMS energy normalization:

$R_s = \sqrt{\frac{1}{N} \sum_n s[n]^2},\quad R_{w_k} = \sqrt{\frac{1}{N}\sum_n \tilde{w}_k[n]^2},\quad \lambda_k = \frac{R_s}{R_{w_k}}\ \text{(typically unity)},$

to fix the nominal signal-to-noise ratio (SNR) at approximately $0$ dB.

Linear superposition and hard clipping:

$x[n] = \text{clip}\left( s[n] + \sum_{k=1}^K \lambda_k \tilde{w}_k[n], -1, 1 \right),$

or, in continuous time, $x(t) = s(t) + \sum_{i=1}^K w_i e_i(t)$ .

By systematically varying $K\in\{1,2,3,4\}$ across four acoustic scenarios, RSA-Bench generates 16 noisy conditions plus the clean reference for every tested utterance, resulting in over 100,000 total test clips (Zhang et al., 15 Jan 2026).

3. Environmental Soundscapes

RSA-Bench models interference using four empirically selected environmental scenarios:

Scenario	Interference Type
Pasture	Non-stationary animal vocalizations (cows, dogs, chickens, sheep)
Extreme Weather	Broadband mechanical noise (rain, wind), intermittent thunder, wind chimes
Classroom	Periodic acoustic gaps (clock ticks, typing, coughing, sipping)
Outdoors	Unstructured "vocal-like" sounds (children, birds), along with streams, footsteps

Each noise source is RMS-aligned, ensuring that modeled differences isolate the effects of interleaved interference structure, not just energy or loudness. The interference count $K$ controls ecological “stress.”

4. Task Taxonomy and Scoring Metrics

RSA-Bench covers six core tasks, partitioned into perception/paralinguistics and cognitive reasoning:

A. Perception & Paralinguistics

Automatic Speech Recognition (ASR): Transcribe $s(t)$ to text. Metric: Word Error Rate (WER).

$\mathrm{WER} = \frac{S+D+I}{N_{\text{ref}}}$

where $S$ , $D$ , $I$ are the counts of substitutions, deletions, insertions, respectively.

Gender Recognition (GR): Binary speaker gender (on IEMOCAP). Metric: Accuracy, $\mathrm{Acc} = \frac{\# \text{correct}}{\# \text{samples}}$ .
Emotion Recognition (ER): Prosodic/affective state on MELD. Metric: LLM-judge score (0–100 scale).

B. Cognitive Reasoning

Mathematical Reasoning (MR): Extract spoken numbers, compute results (SpokenMQA). Metric: Exact match accuracy.
Speech QA (SQA): Answer factoid questions about passages (SLUE Phase-2). Metric: LLM-judge score or Mean Reciprocal Rank (MRR).
Speech Instruction Following (SI): Follow open-ended audio instructions (OpenHermes). Metric: LLM-judge compliance.

Semantic tasks are scored using “LLM-as-a-judge” annotation (e.g., GPT-4o-mini), generating a normalized [0,100] metric.

5. Empirical Insights and Findings

RSA-Bench yields three pivotal findings (Zhang et al., 15 Jan 2026):

Perception-Cognition Robustness Gap: Perceptual tasks (ASR, GR, ER) degrade sub-linearly with increased $K$ , e.g., GR remains above 80% accuracy at $K=4$ . In contrast, cognitive tasks (MR, SQA, SI) exhibit functional collapse: mathematical accuracy and instruction/QA scores plummet as $K$ rises, displaying almost exponential performance drop. For example, MR accuracy approaches zero by $K=4$ , and SI/SQA scores fall from $\sim$ 80 to $<$ 20.
Scenario Sensitivity: "Vocal-like" backgrounds (Outdoors) are the most destructive. For Qwen3-Omni at $K=3$ , ASR WER in Outdoors jumps to 259.56%, while in Classroom it is only 4.77%. Statistical t-tests confirm Outdoors significantly reduces both ASR and semantic task scores versus mechanical or sparser noise (p < 0.01).
The Denoising Paradox: Applying off-the-shelf speech enhancement (Noisereduce, RNNoise, Audio-Denoise, DeepFilterNet) seldom improves and often worsens ALM performance. For most tasks, $\text{Perf}(f, \mathcal{D}[x]) < \text{Perf}(f, x)$ . This is attributed to denoising-induced spectral and temporal distortions, which, while reducing energetic noise, introduce semantic artifacts detrimental to ALMs.

6. Design Recommendations and Future Directions

Based on observed model fragility under ecologically valid interference, the following strategies are recommended (Zhang et al., 15 Jan 2026):

Noise-aware instruction tuning: Fine-tune ALMs using multi-source, ecologically valid noise mixtures to promote modality-invariant representations.
Adversarial training with real-world noise: Train with mixtures across $K=1..4$ interference sources to flatten the robustness curve for complex tasks.
Integrated denoising front-ends: Jointly optimize denoising and downstream ALM objectives, rather than relying on fixed DSP pipelines.
Scenario-adaptive attention mechanisms: Develop modules that can detect and dynamically downweight "vocal-like" nuisance signals, modulating internal focus appropriately.

A plausible implication is that robust audio understanding in the wild will require not only architectural innovations but also fundamentally new training paradigms exploiting the kind of stress gradients instantiated by RSA-Bench.

7. Significance and Impact

RSA-Bench represents a transition to ecologically valid, stress-driven evaluation of ALMs, revealing brittleness not exposed by conventional synthetic noise protocols. It provides a large-scale, reproducible foundation for both diagnostic and training interventions aimed at achieving intrinsic robustness for models deployed in real-world auditory environments. The comprehensive public dataset, codebase, and systematic methodology facilitate broad adoption for model development and cross-benchmarking within the speech and audio research community (Zhang et al., 15 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

RSA-Bench: Benchmarking Audio Large Models in Real-World Acoustic Scenarios (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RSA-Bench.

RSA-Bench: Evaluating Audio Model Robustness

1. Definition and Objectives

2. Construction Methodology

3. Environmental Soundscapes

4. Task Taxonomy and Scoring Metrics

5. Empirical Insights and Findings

6. Design Recommendations and Future Directions

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RSA-Bench: Evaluating Audio Model Robustness

1. Definition and Objectives

2. Construction Methodology

3. Environmental Soundscapes

4. Task Taxonomy and Scoring Metrics

5. Empirical Insights and Findings

6. Design Recommendations and Future Directions

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research