MUSHRA Listening Test Overview

Updated 26 November 2025

MUSHRA Listening Test is a controlled protocol for assessing perceived audio quality using hidden reference and anchor stimuli.
It enables fine-grained discrimination of audio systems such as codecs, TTS models, and spatial renderers through rigorous statistical analysis.
Variants like MUSHRA-NMR and MUSHRA-DG address biases and improve reliability, fostering innovations in audio evaluation.

The MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) listening test is a controlled perceptual evaluation protocol, originally adopted in ITU-R BS.1534, for quantifying perceived audio quality across multiple systems or conditions. It has become the reference standard for fine-grained discrimination among high-quality audio codecs, neural text-to-speech (TTS) models, spatial audio renderers, and various speech enhancement pipelines. MUSHRA's core structure, reliance on reference and anchor stimuli, and design for batch evaluation of many systems underpin its widespread adoption for scientific benchmarking. However, as state-of-the-art generative systems now rival or occasionally surpass human references—particularly in speech synthesis and coding—recent research has scrutinized the statistical sensitivity, experimental biases, and practical limits of the MUSHRA paradigm, prompting substantial methodological innovation.

1. Standard MUSHRA Test: Protocol and Statistical Framework

The canonical MUSHRA test presents listeners with a set of $N+2$ audio stimuli of the same content per trial page:

Explicit Reference (REF): A labeled sample of the original, unprocessed audio or human speech.
Low-Quality Anchor (ANC): A deliberately degraded version (e.g., band-limited or resampled) to define the bottom of the subjective scale.
$N$ Anonymized System Outputs: Outputs from candidate systems under test, randomized in order.

Listeners rate each sample on a continuous 0–100 scale, segmented as follows: 100–80 (Excellent), 80–60 (Good), 60–40 (Fair), 40–20 (Poor), 20–0 (Bad). No direct pairwise comparisons are required; each trial enables simultaneous evaluation of all systems.

Scoring and Aggregation:

For each system, the standard outputs are the mean $μ$ , standard deviation $σ$ , and the 95 % confidence interval

$\mathrm{CI} = 1.96 \cdot \frac{\sigma}{\sqrt{L \cdot U}}$

where $L$ is the number of listeners and $U$ is the number of utterances or items per system (Varadhan et al., 19 Nov 2024).

Statistical Analysis:

Typical inferential methods include t-tests or repeated-measures ANOVA on per-utterance means, with explicit reporting of $p$ -values for significant differences (Gaznepoglu et al., 2023).

Participant Screening:

Rater exclusion is often implemented: listeners scoring the REF below a set threshold (e.g., < 90 on >15% of cases) are removed to ensure data quality (Torcoli et al., 2023, Joubaud et al., 4 Jun 2025, Ostan et al., 1 Aug 2025).

2. Limitations of Classical MUSHRA With Modern Generative Systems

Recent evaluations of near-human or super-human generative models revealed substantive flaws in standard MUSHRA for high-performance TTS, speech codecs, and enhancement systems.

Reference-Matching Bias:

Listeners implicitly judge system outputs by proximity to the labeled REF, artificially lowering the score when the system is prosodically or perceptually different—even if it is subjectively superior (Varadhan et al., 19 Nov 2024). For example, VITS TTS achieved a CMOS close to REF (CFMOS ≈ –0.10), yet had a MUSHRA gap of over 16 points (REF 84.2, VITS 67.7).

Judgment Ambiguity:

A single 0–100 "quality" slider leaves raters uncertain whether to prioritize intelligibility, prosody, absence of artifacts, or pronunciation, yielding high inter- and intra-rater variance (Hindi/Tamil TTS, Figure 1 in (Varadhan et al., 19 Nov 2024)).

Listener Fatigue and Scaling:

Presenting more than 5–12 items per page degrades reliability. Absolute Category Rating (ACR), while more scalable, saturates at the top of the scale and cannot discriminate fine detail (Lechler et al., 23 Sep 2025).

Quantitative Sensitivity:

Performance plateaus: adding listeners improves score reliability more substantially than increasing utterances, with >95% rank correlation achievable with as few as 20 utterances or 40 listeners (Varadhan et al., 19 Nov 2024).

3. Variants and Modern Modifications of MUSHRA

MUSHRA-NMR (No Mentioned Reference)

Hides the explicit human reference from the interface, mitigating reference-matching bias. All other protocols (anchor, utterance count, randomization) are retained, but the REF is included only for CI computation and reported as an anonymized "System X" (Varadhan et al., 19 Nov 2024). The gap between the best TTS and REF shrank by up to 6 points in Hindi/Tamil experiments; the REF's own score fell, reflecting stricter, less biased judgments. Reliability (e.g., Spearman rank correlation > 0.95) is robust with moderate sample sizes.

MUSHRA-DG (Detailed Guidelines)

Implements multidimensional error annotation: listeners tally specific error categories (mild/severe pronunciation, speed alteration, artifacts, word skips), alongside 0–100 ratings for liveliness, voice quality, and rhythm. A deterministic formula combines these into a single MUSHRA score: $S_M = \frac{L + VQ + R}{3} - \min(MP,15)\times5 - \min(SP,7)\times10 - US\times5 - DA\times5 - SEF\times5 - WS\times25$ where each abbreviation corresponds to a perceptual error class (Varadhan et al., 19 Nov 2024). Inter-rater standard deviation drops by 43%–53%, and system means converge toward CMOS, isolating faults such as word skips and energy fluctuations.

MUSHRA-DG-NMR (Combined Variant)

Simultaneously removes the reference and introduces detailed error guidelines. This approach yields low-variance, high-precision ratings (σ ≈ 10–12), with system means closely aligning and all scores residing in the "Excellent" range for advanced models.

MUSHRA-1S (Single-Stimulus)

Each trial page presents only the test system, a fixed anchor, and a fixed hidden reference. Listeners position a single slider on the 0–100 scale, where endpoints are explicitly tied to the anchor and reference (Lechler et al., 23 Sep 2025). MUSHRA-1S matches the sensitivity of standard MUSHRA in distinguishing small artifacts, even among top-tier systems, but matches the scalability and throughput of ACR. Absolute differences with canonical MUSHRA ratings are typically below 1.5 points, and the method is robust against range-equalizing biases.

4. MUSHRA Beyond Speech: Applications and Adaptations

MUSHRA has been extended to multiple domains:

Audio Quality and Codec Evaluation:

Used for both legacy DSP and neural codecs, often with expert panels and broader stimulus sets (Torcoli et al., 2023, Lanzendörfer et al., 24 Nov 2025).

Spatial Audio in Virtual Reality:

VR MUSHRA platforms allow assessment across multiple spatial positions and dimensions—localizability, spatial and timbral quality—in immersive environments. Listeners freely teleport among measurement points, adjust multiple sliders per attribute, and receive behavioral logging (Ostan et al., 1 Aug 2025).

Speech Enhancement and Sensor Evaluation:

Protocols tailored for body-conducted sensors compare raw vs. enhanced signals from multiple device types (accelerometer, in-ear, throat), using anchor/reference calibration and detailed statistical reporting (mean, 95% CI, gender separation) (Joubaud et al., 4 Jun 2025).

Crowdsourced adaptations with non-expert raters implement rigorous training, qualification (e.g., digits-in-noise, gold-standard practice pages), and pre/post-screening for reliability. Cross-platform bias studies show that Prolific (as a platform) closely matches expert-lab absolute ratings, while MTurk exhibits a systematic upward bias but preserves ranking. Strong test–retest reliability (ICC > 0.99) is achievable with 15–25 judgements per item (Lechler et al., 1 Jun 2025).

5. Statistical Formulations, Data Analysis, and Metric Correlation

Core Statistical Metrics:

Mean score: $\bar{X} = \frac{1}{N} \sum_{i=1}^N x_i$
Confidence interval: $95\%$ CI $\approx \bar{X} \pm t_{0.975,N-1} \cdot \mathrm{SE}$ , with $\mathrm{SE} = \sigma / \sqrt{N}$
Paired t-test: $t = (\bar{\Delta} - 0) / (s_\Delta / \sqrt{N})$
Pearson correlation coefficient: Used for aligning subjective and objective metrics (Lanzendörfer et al., 24 Nov 2025).
Inter-lab offsets: Observed on low-quality anchors but converge at higher quality.

Metric Alignment:

MUSHRA results provide the ground truth against which objective measurements are calibrated. For neural codecs, classic PESQ ( $\rho=0.886$ ), SCOREQ ( $\rho=0.937$ for speech-only), and to a lesser extent STOI and ViSQOL-Speech demonstrate strong correlations with mean MUSHRA scores, whereas neural-MOS predictors (NISQA, DNSMOS) underperform, especially for generative models (Lanzendörfer et al., 24 Nov 2025, Lechler et al., 1 Jun 2025, Torcoli et al., 2023).

Table: MUSHRA Score Ranges and Variance Reduction (from (Varadhan et al., 19 Nov 2024))

Variant	Hindi σ	Tamil σ	Bias Correction
Standard	~22	~18	None
MUSHRA-NMR	~20	~17	Hides Reference
MUSHRA-DG	~12	~8	Error Categories
MUSHRA-DG-NMR	~10-12	~10-12	Both Above

Reducing both reference bias and rater ambiguity is essential for robust, actionable TTS and audio evaluation.

6. Best Practices and Implementation Guidelines

Anchor Selection:

Use an unambiguously sub-"Poor" anchor, but omit high-quality anchors to avoid downward bias.

Reference Handling:

Under NMR and DG-NMR, hide the explicit reference so listeners judge absolute quality, not similarity.

Detailed Criteria:

Employ explicit error categories and penalties for variance reduction and unified ratings.

Sampling:

Target ≥ 30 listeners and ≥ 30 utterances for 95%+ rank correlation; prioritize listener count over utterance count if constrained (Varadhan et al., 19 Nov 2024).

Rater Training:

Pilot with feedback for calibration; include gold-standard and catch-trial mechanisms for quality control (Lechler et al., 1 Jun 2025).

Session Management:

Split long test sets, encourage breaks, and avoid continuous exposure to prevent fatigue-induced bias.

Score Reporting:

Always report $\mu$ , $\sigma$ , 95% CI, detailed demographics, instructions, anchor/reference design, and rater exclusions for reproducibility.

Crowdsourcing:

Utilize comprehensive screening, qualification, and interface simplification for robust, cost-effective data collection. Use platform-specific bias correction as needed (Lechler et al., 1 Jun 2025).

MUSHRA-1S enables scalable, high-precision large-scale benchmarking with fixed anchor/reference context and is recommended for longitudinal or batch studies of top-tier systems (Lechler et al., 23 Sep 2025).

7. Future Directions and Open Research Questions

Methodological research on MUSHRA continues to evolve. Key frontiers include:

Automated Preference Prediction:

Pairwise preference networks (e.g., PrefNet) trained on MUSHRA data automate subjective ranking, potentially reducing experimental cost and screening "obvious" pairs (Valentini-Botinhao et al., 2022).

Probabilistic Modeling:

Conditional distribution modeling (Generative Machine Listener) enables simulation of virtual panels, confident prediction of score CIs, and adaptive allocation of rater effort (Jiang et al., 2023).

Domain-Adapted Protocols:

Extensions to spatial audio, VR immersion, and body-conducted speech all underscore the flexibility of the MUSHRA protocol and the necessity for context-sensitive modifications (Ostan et al., 1 Aug 2025, Joubaud et al., 4 Jun 2025).

Objective Metric Generalization:

As generative systems surpass traditional codec boundaries, benchmarking of new metrics against MUSHRA ground truth remains critically important (Lanzendörfer et al., 24 Nov 2025, Lechler et al., 1 Jun 2025).

MUSHRA, in its evolving forms, remains foundational for perceptual audio evaluation. Ongoing innovation addresses its limitations, ensuring continued utility across the full spectrum of generative and signal-processing audio technologies.