VoiceMOS Challenge 2022 Overview
- VoiceMOS Challenge 2022 is a benchmark initiative that evaluates automatic MOS prediction models for synthetic speech using comprehensive in-domain and out-of-domain datasets.
- It leverages state-of-the-art SSL encoders and metadata fusion with contrastive and regression losses to overcome limitations of traditional evaluation metrics.
- The challenge drives robust TTS and VC evaluation by employing balanced metrics, novel pseudo-labeling, and domain-adaptation techniques to enhance prediction accuracy.
VoiceMOS Challenge 2022 is a scientific benchmark initiative focused on automatic prediction of the Mean Opinion Score (MOS) of synthetic speech, addressing limitations of traditional non-intrusive instrumental metrics and aiming to advance fully data-driven evaluation protocols for text-to-speech (TTS) and voice conversion (VC) systems. The challenge provides large-scale listening-test datasets and a standardized evaluation platform that supports rigorous measurement of both in-domain and out-of-domain generalization, with the explicit goal of pushing research toward robust, accurate, and generalizable MOS prediction.
1. Motivation and Dataset Design
The core objective of VoiceMOS Challenge 2022 is to evaluate and compare automatic MOS prediction models capable of estimating human-rated speech quality in a non-reference, signal-based regime. Historically, MOS prediction had relied on labor-intensive listening tests, while objective metrics such as PESQ require a high-fidelity reference and are intolerant to prosodic variation, making them unsuitable for synthetic speech evaluation. VoiceMOS 2022 directly addresses two core issues: (1) providing a comprehensive dataset for training and fair comparison, and (2) introducing tasks that require adaptation to substantial domain shifts.
The main track dataset comprises 7,106 utterances from 187 synthesis systems sourced from the Blizzard Challenge (2008–2016), VCC (2016–2020), and ESPnet-TTS. Each utterance is rated for naturalness on a 1–5 Likert scale by 8 listeners, yielding over 56,000 MOS labels. The out-of-domain (OOD) track uses 676 samples from the 2019 Blizzard Challenge (Mandarin TTS; Beijing accent), with 10–17 MOS ratings per sample. Both tracks are partitioned into train/dev/test splits, ensuring matching MOS distribution and holding out unseen systems, speakers, and listeners for evaluation.
| Track | Language | Train | Unlabeled | Dev | Test |
|---|---|---|---|---|---|
| Main | English | 4,974 | – | 1,066 | 1,066 |
| OOD | Chinese | 136 | 540 | 136 | 540 |
Main-track distribution ensures diversity across synthesis technology, speakers, and listener demographics, while the OOD track aggressively tests cross-lingual and low-resource adaptation. Listener IDs are grouped into fixed octets, yielding a rating panel of 304 raters across both splits.
2. Evaluation Metrics and Baseline Systems
Quantitative comparison is provided at both utterance and system level, using four primary metrics:
- Mean Squared Error (MSE):
- Linear Correlation Coefficient (LCC): measures linear association between predicted and true MOS.
- Spearman’s Rank Correlation Coefficient (SRCC): ; the primary ranking metric.
- Kendall’s Tau (KTAU): rank concordance.
Baseline systems:
- B01 SSL-MOS: Fine-tunes a linear layer atop wav2vec 2.0 Base; utterance MSE = 0.277, SRCC = 0.869; system MSE = 0.148, SRCC = 0.921.
- B02 MOSA-Net: Multi-objective regression on cross-domain features.
- B03 LDNet: Listener-dependent network that models per-listener ratings.
System-level SRCC is the primary challenge metric, reflecting ranking fidelity across synthesis methods.
3. Modeling Approaches, Metadata, and Feature Fusion
Top-ranked systems universally deploy self-supervised learning (SSL) speech encoders—primarily wav2vec 2.0 or HuBERT—as primary acoustic feature sources. Two broad strategies emerge: (a) direct end-to-end fine-tuning of SSL encoders for MOS regression, and (b) model ensembling and stacking on top of diverse SSL-based predictors.
Chinen et al. (Chinen et al., 2022) show that integrating rater- and system-identifier metadata as one-hot vectors, concatenated with SSL-based acoustic embeddings, explains a significant fraction of the variance in ratings—utterance-level SRCC of 0.787 is achievable using only metadata. The most effective fusion combines a pooled 64-dimensional acoustic embedding with one-hot encodings for rater groups and system IDs (with “unknown” classes for robustness to novel categories). A feature-concatenation operation
is used, where is the acoustic embedding and is concatenated metadata, followed by a stack of fully connected layers. Dropout-style replacement of specific ID vectors with an “unknown” token ensures generalization to unseen raters or systems.
In UTMOS (Saeki et al., 2022), strong learners use end-to-end fine-tuning with BLSTM layers and integrate listener and domain embeddings, phoneme sequence encoding, and data augmentation in the feature stream. Weak learners apply regression algorithms (ridge, SVR, LightGBM, GP) to mean-pooled SSL features, supporting ensemble diversity. Yang et al. (Yang et al., 2022) fuse seven independently fine-tuned SSL models using a linear model with residuals, demonstrating that direct aggregation of multiple SSL MOS predictors suffices to reach state-of-the-art rank correlation.
In all high-performing submissions, fusing side metadata—especially listener-specific and system-specific signals—closes the gap between acoustic-only and full-system performance.
4. Training Protocols and Loss Functions
The standard protocol across top systems is to fine-tune pre-trained SSL encoders on the VoiceMOS labeled MOS data. Training objectives include:
- L1 or L2 (MSE) loss on MOS, e.g., (MSE); sometimes combined with a clipped variant or mixed with contrastive loss on MOS differences (as in UTMOS).
- Contrastive loss between pairs: for margin .
Stacking and ensemble meta-learners are trained on cross-validated out-of-fold predictions, typically using regression algorithms over the base predictors’ outputs.
Several systems adopt two-step fine-tuning (e.g., ZevoMOS (Stan, 2022)): first training on synthetic vs. natural speech classification, then fine-tuning for regression to MOS using the challenge data.
For OOD adaptation, semi-supervised pseudo-labeling of unlabeled samples is effective: e.g., Yang et al. (Yang et al., 2022) first fine-tune on labeled data, generate pseudo-labels for the held-out OOD set, then re-train on both real and pseudo-labeled examples.
5. Results, Experimental Analysis, and Ablations
Challenge results indicate global advances in non-intrusive MOS prediction:
- Main-track leaders achieve system SRCC ≈ 0.936, MSE ≈ 0.090 (UTMOS team T17 and others), and top utterance-level MSE ≈ 0.165, SRCC ≈ 0.897.
- Out-of-domain adaptation: with only 136 labeled Mandarin utterances (plus 540 unlabeled), UTMOS achieves system SRCC ≈ 0.988, MSE ≈ 0.030—validated by the pseudo-labeling and listening-test augmentation strategy.
- Model ablations reveal that listener-aware modeling, contrastive losses, and model ensembling dramatically boost ranking metrics (SRCC/KTAU), with data augmentation and pseudo-labeling crucial for OOD generalization, especially in low-sample settings.
Performance is highly sensitive to unbalanced sampling:
- When per-system utterance counts are unbalanced (with many systems represented by 1–2 utterances), system-level metrics (MSE, SRCC) can become dominated by high-variance estimates.
- Utterance-level metrics are more robust and interpretable in such imbalanced settings.
- A plausible implication is that system-level evaluation should weight systems by utterance count or ensure minimum per-system samples to stabilize the mean.
A summary of main-track system-level results for the leading teams:
| Team | MSE | SRCC | Rank |
|---|---|---|---|
| T17 (UTMOS) | 0.090 | 0.936 | 1 |
| T11 | 0.101 | 0.939 | 2 |
| T19 | 0.091 | 0.938 | 3 |
| Baseline B01 | 0.148 | 0.921 | – |
On the OOD track, a similar performance stratification is observed, with pseudo-labeling and self-labeled adaptation critical for low MSE.
6. Lessons, Recommendations, and Future Directions
Challenge outcomes support several methodological insights:
- Fine-tuned SSL encoders anchor the current state-of-the-art for MOS prediction on both in-domain and cross-domain benchmarks.
- Metadata (listener and system IDs) explain a large portion of apparent variance, making their inclusion—especially via properly regularized one-hot or embedding schemes—mandatory for high accuracy.
- Contrastive loss objectives and model stacking/ensembling mitigate overfitting and yield monotonic improvements in both MSE and ranking metrics.
- Ensuring balanced sampling across conditions (minimum per system) in evaluation design is critical; otherwise, system-level metrics are susceptible to sample mean error due to the law of large numbers.
Major open challenges remain in robust OOD adaptation, modeling fine-grained listener or domain shift, and constructing unified metrics that capture both error and rank correlation, particularly as synthetic speech approaches human-parity MOS regimes.
Recommended best practices include:
- Use utterance-level metrics for imbalanced or low- per system evaluations.
- Weight system-level metrics by utterance count when necessary.
- Fuse side-informative metadata with acoustic embeddings and apply regularization for generalization to novel listeners/systems.
- Collect additional labels or pseudo-labels for new domains to support robust adaptation.
Continued development is expected toward sample-efficient adaptation methods, improved modeling of per-listener biases, and differentiable MOS objectives tightly integrated with synthesis model training. There is substantial interest in advancing semi-supervised, multi-rate, and cross-linguistic models beyond the current SSL-dominated paradigm.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free