Mean Opinion Scores (MOS)
- Mean Opinion Scores (MOS) is a subjective metric defined as the arithmetic mean of listener ratings on a fixed 5-point scale to quantify speech and media quality.
- Data quality control techniques like validation clips, outlier rejection, and bias adjustments ensure reliable aggregation of MOS at both utterance and system levels.
- Recent advancements extend MOS to multivariate and distributional models, enabling automatic prediction and improved handling of human rating biases.
Mean Opinion Scores (MOS) are the principal subjective metric for evaluating perceived quality in synthesized speech, telecommunication, and other media systems. MOS aggregates categorical human ratings into a scalar, serving as the de facto standard for benchmarking generative audio and speech systems, and is foundational to both large-scale subjective evaluations and the training and calibration of non-intrusive automatic predictors.
1. Formal Definition and Subjective Protocols
MOS is defined as the arithmetic mean of N human-assigned ratings, typically on a five-point absolute category rating (ACR) scale. For a given stimulus (e.g., speech utterance) , with listener responses :
MOS is collected for either individual utterances or aggregated at the system level (e.g., a single text-to-speech system) by averaging across multiple utterances (Maniati et al., 2022). Human listeners assign ratings according to fixed scale anchors, such as:
| Value | Anchor (Speech Naturalness) |
|---|---|
| 1 | very unnatural |
| 2 | somewhat unnatural |
| 3 | neither natural nor unnatural |
| 4 | somewhat natural |
| 5 | completely natural |
Listeners are instructed to use headphones in quiet environments, and listening protocols typically enforce playback of all clips at least once. Reliability is monitored by embedding ground-truth natural speech controls and explicit validation clips (Maniati et al., 2022). The variability of ratings is captured by the sample standard deviation , and confidence intervals are estimated as:
2. Data Quality Control and Biases
Crowdsourcing has enabled large-scale MOS annotation (e.g., SOMOS: >20,000 utterances, >350,000 ratings (Maniati et al., 2022)), but introduces quality risks requiring robust outlier rejection:
- Validation clip failures
- Low ratings on ground-truth natural speech
- Improbable uniformity of synthetic ratings
- Synthetic MOS exceeding that of the natural reference minus a margin
Pages failing any criteria are excluded from "clean" evaluation sets. Inter-rater reliability, assessed by bootstrap resampling, consistently yields Pearson and Spearman correlations at the utterance level and at the system level after such cleaning.
A critical phenomenon is range-equalizing bias: listeners tend to repurpose the full 1–5 range even across different test contexts (Cooper et al., 2023). This non-anchored scale use can induce MOS shifts of up to a full point for identical material, necessitating fixed anchors or combining MOS with pairwise methods to enable absolute comparisons.
3. Extensions Beyond Scalar MOS: Multivariate and Distributional Approaches
While traditional MOS is a univariate summary, perceptual quality consists of multidimensional aspects (e.g., noisiness, coloration, discontinuity, loudness in NISQA). Recent models use multivariate Gaussian posteriors to jointly model MOS and orthogonal quality dimensions, predicting not only the mean vector but the full covariance structure. The posterior , where 0 denotes the vector of ratings, allows uncertainty quantification and correlation analyses—enabling diagnosis of which factors co-vary with low MOS (Cumlin et al., 5 Jun 2025).
Additionally, distributional approaches, such as quantized distribution fitting, model the latent continuous perception underlying discrete ratings by fitting a Gaussian and projecting onto observed categories, providing a continuous-valued latent "peak" as the training target for MOS regression and mitigating quantization and range bias effects (Kondo et al., 23 Jun 2025).
4. Automatic MOS Prediction: Architectures and Learning Paradigms
Non-intrusive MOS prediction systems have evolved from simple regression models on spectral features to complex neural architectures leveraging self-supervised audio representations and tailored pooling mechanisms. Representative methods include:
| Model | Key Mechanism | Input | Highlights |
|---|---|---|---|
| MOSNet | CNN → BLSTM → Pooling → Regression | STFT magnitude | Baseline |
| LDNet | Listener-dependent embeddings | Spectral frames + listener | Bias adaptation |
| MBNet | Mean and judge-bias subnets | STFT magnitude + judge ID | Bias modeling, leverages all scores |
| SSL-MOS | wav2vec/SSL → Pooling → Regression | Self-supervised features | Robust to domains |
| NORESQA-MOS | Pairwise with non-matching references | Audio pairs | Relative calibration, OOD robustness |
| DRASP | Dual-resolution adaptive pooling | Frame embeddings | Improved ranking, SRCC |
Newer pooling mechanisms, such as dual-resolution attentive statistics pooling (DRASP), fuse global and segmental attentional statistics and yield 1 improvements in system-level SRCC over classic average pooling (Yang et al., 29 Aug 2025). Pairwise and distributional frameworks (NORESQA-MOS, MOSPC) address the relatively poor ranking sensitivity of mean-based regression, boosting ranking metrics (Kendall's 2) and fine-grained segment ordinal accuracy (Manocha et al., 2022, Wang et al., 2023).
5. Addressing Human Bias and Reliability in MOS Labels
The reliability of MOS is fundamentally limited by human rating behaviors:
- Inter-individual bias: Judge-specific leniency or severity can drive variance. Mean-bias architectures (MBNet) and listener-specific embeddings (LDNet) directly model these effects, improving both utterance- and system-level correlation (Leng et al., 2021, Huang et al., 2021).
- Systematic context bias: Range-equalizing (rubber-ruler) distortions can change mean MOS by up to a full point based on the context of the test (Cooper et al., 2023). Best practice includes the use of absolute anchors, explicit cross-test calibration, and tied-rank corrections.
- Ranking ambiguity and CI overlap: When applying rank-based statistics (e.g., Spearman, Kendall), the usual "equal value = tie" rule is inadequate since most MOS values are unique but within their CIs indistinguishable. Overlap-based transformations assign tied ranks wherever 95% CIs overlap, resulting in more robust and statistically valid correlation and significance tests (Naderi et al., 2020).
Bias correction can also use leave-one-out and linear scaling methods to debias rater contributions, with measurable improvements in model performance (Akrami et al., 2022).
6. MOS in Broader Quality of Experience (QoE) and Future Methodological Directions
While MOS remains the core scalar for media quality benchmarking, it is increasingly viewed as a single element in richer, contract-based or multidimensional quality characterizations. A contract-driven QoE auditing framework recasts MOS as a degenerate contract and enables stability across aggregation views and enhanced sample efficiency via human-interpretable Boolean predicates (Du, 4 Dec 2025). Multivariate and contract-aware models capture rating heterogeneity, uncertainty, and user-specific thresholds not addressed by traditional MOS. For high-end streaming and bitrate ladder optimization, the linkage between MOS and perceptual thresholds (e.g., JND, SUR) is being systematized, but ambiguities in reverse mapping persist due to subjective overlap in MOS bands (Zhu et al., 19 Feb 2026).
Researchers continue to pursue automatic MOS predictors with improved generalization, rater modeling, bias correction, and explainability. Open challenges include robust OOD transfer, MOS prediction for new modalities (e.g., TTM, singing voice), accurate modeling of rating distributions rather than only their mean, and reliable cross-context comparisons under variable test conditions.
References
- SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis (Maniati et al., 2022)
- Multivariate Probabilistic Assessment of Speech Quality (Cumlin et al., 5 Jun 2025)
- Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting (Kondo et al., 23 Jun 2025)
- DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction (Yang et al., 29 Aug 2025)
- MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network (Leng et al., 2021)
- LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech (Huang et al., 2021)
- Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech (Cooper et al., 2023)
- Transformation of Mean Opinion Scores to Avoid Misleading of Ranked based Statistical Techniques (Naderi et al., 2020)
- Speech MOS multi-task learning and rater bias correction (Akrami et al., 2022)
- Contract-Driven QoE Auditing for Speech and Singing Services (Du, 4 Dec 2025)
- Is there a relationship between Mean Opinion Score (MOS) and Just Noticeable Difference (JND)? (Zhu et al., 19 Feb 2026)
- Speech Quality Assessment through MOS using Non-Matching References (Manocha et al., 2022)
- MOSPC: MOS Prediction Based on Pairwise Comparison (Wang et al., 2023)