VoiceMOS Challenge Overview

Updated 6 March 2026

VoiceMOS Challenge is an international initiative for advancing automatic, reference-free prediction of human MOS scores across diverse speech domains.
It evaluates data-driven models using rigorous metrics such as MSE, PLCC, and SRCC on multi-domain, multi-language listening tests.
The challenge fosters innovative developments in self-supervised learning, hybrid, and ensemble architectures to overcome domain shift and low-resource adaptation.

The VoiceMOS Challenge is an international scientific initiative focused on advancing automatic, reference-free prediction of human mean opinion scores (MOS) for speech quality assessment across diverse domains, including text-to-speech (TTS), voice conversion (VC), singing voice synthesis/conversion (SVS/SVC), and speech enhancement (SE). Running annually since 2022, the challenge series systematically benchmarks data-driven approaches for emulating human listening test ratings, with particular emphasis on generalization, low-resource adaptation, and robust multi-domain performance. It occupies a central position in the empirical landscape of subjective speech evaluation research, establishing de facto community standards for both task formulation and system-level performance reporting.

1. Evolution and Scope of the VoiceMOS Challenge

The VoiceMOS Challenge emerged to address fundamental bottlenecks in speech quality assessment: human MOS tests are resource-intensive, slow, and fraught with inter-listener, inter-domain, and inter-protocol variability. The overarching goals are to stimulate the development and rigorous benchmarking of non-intrusive, data-driven models for MOS prediction, and to foster generalization to unseen systems, languages, and rater populations (Huang et al., 2022, &&&1&&&, Huang et al., 2024).

Challenge editions are marked by progressive complexity:

2022 introduced a dual-track format: a main "in-domain" track using English synthetic speech and an "out-of-domain" (OOD) track for domain adaptation using limited Chinese labeled data (Huang et al., 2022).
2023 emphasized zero-shot MOS prediction, explicitly withholding MOS labels for most evaluation splits and expanding to speech enhancement and singing voice scenarios (Cooper et al., 2023).
2024 further diversified with three tracks: (1) "zoomed-in" high-quality TTS system discrimination, (2) multilingual, multi-system singing synthesis/conversion, and (3) semi-supervised speech quality prediction of noisy/clean/enhanced samples under strict data scarcity (Huang et al., 2024).

Each edition expands the empirical focus to broader domains, more challenging generalization settings, and new evaluation protocols.

2. Dataset Construction and Listening Test Protocols

Datasets underpinning VoiceMOS tracks are curated for diversity in system type, linguistic domain, and rater demographics:

Main Track Datasets: Large-scale, lab-based listening tests from Blizzard TTS Challenges, Voice Conversion Challenges, ESPnet-TTS, and voicebank datasets—typically involving hundreds of systems and thousands of utterances rated on a 1–5 MOS scale (Huang et al., 2022, Ragano et al., 2022, Baba et al., 2024, Huang et al., 2024).
Out-of-domain and Multilingual Data: OOD splits in Chinese (BC2019), French, and multilingual singing (SingMOS, Japanese/Chinese) extend the coverage to new listening-test languages, system architectures, and recording conditions (Cooper et al., 2023, Huang et al., 2024).
Noisy/Enhanced Speech: ITU-T P.835 protocols with SIG (signal distortion), BAK (background noise), and OVRL (overall quality) provide multidimensional ratings under adverse and enhancement conditions (Huang et al., 2024, Kunešová, 31 May 2025).

New listening tests are routinely commissioned for "zoomed-in" subsets (top 12–25% of systems by MOS) and cross-system scenarios to probe fine-grained quality distinctions. Rater assignments are controlled, with each utterance rated by multiple listeners (typically 5–8), and, in some protocols, detailed listener and system metadata are captured to facilitate advanced modeling of individual and group differences (Chinen et al., 2022, Qi et al., 2023).

3. Evaluation Metrics and Benchmarking

The challenge standardizes a rigorous set of performance metrics at both the utterance and system levels:

Mean Squared Error (MSE):

$\mathrm{MSE} = \frac{1}{N}\sum_{i=1}^{N} (\hat y_i - y_i)^2$

Pearson Linear Correlation Coefficient (PLCC/LCC):

$\mathrm{PLCC} = \frac{\sum_i (\hat y_i - \bar{\hat y})(y_i - \bar y)}{\sqrt{\sum_i(\hat y_i - \bar{\hat y})^2} \sqrt{\sum_i(y_i - \bar y)^2}}$

Spearman Rank Correlation Coefficient (SRCC):

$\mathrm{SRCC} = 1 - \frac{6 \sum_i d_i^2}{N(N^2-1)},\quad d_i = \operatorname{rank}(\hat y_i) - \operatorname{rank}(y_i)$

Kendall’s Tau (KTAU):

$\mathrm{KTAU} = \frac{C-D}{0.5N(N-1)}$

with $C$ and $D$ concordant/discordant pairs.

Where applicable (notably in enhancement tracks), metrics are reported for all three ITU-T P.835 categories (SIG, BAK, OVRL). Primary challenge rankings use system-level SRCC to emphasize accurate system ranking—a practical priority in quality benchmarking (Huang et al., 2022, Huang et al., 2024).

4. Modeling Approaches and System Architectures

The trajectory of leading systems highlights convergence on self-supervised learning (SSL) backbones, feature fusion, retrieval, and meta-learning:

Fine-tuned SSL Predictors: Dominant across all editions are models based on wav2vec 2.0, HuBERT, and WavLM, fine-tuned for MOS regression using MSE or multi-task objectives. Frame-wise pooling strategies, listener/domain metadata embeddings, and contrastive losses are recurrent enhancements (Saeki et al., 2022, Cooper et al., 2023, Huang et al., 2024).
Hybrid Feature and Ensemble Models: Fused representations from SSL and spectrogram-image encoders (e.g., EfficientNetV2 on mel-spectrograms) capture complementary cues: SSL layers encode global structure and system-level ranking, while spectrogram-based features excel at detecting local artifacts and improving calibration (low MSE) (Baba et al., 2024).
Retrieval and Non-parametric Augmentation: Retrieval-augmented models integrate k-nearest-neighbor ranking or prior-net weighting for robust zero-/few-shot generalization, particularly excelling in "zoom-in" discrimination among top-tier TTS systems (Huang et al., 2024).
Task-specific Pre-training: Multi-stage pre-training (e.g., SNR→BAK mapping, spoof/natural for SIG) facilitates label-efficient adaptation to SE and noisy speech scenarios (Kunešová, 31 May 2025, Huang et al., 2024).
Metadata and Listener Modeling: Listener-dependent architectures and the explicit use of system/rater metadata provide regularization against domain shift and permit per-listener prediction (Chinen et al., 2022, Qi et al., 2023, Huang et al., 2024).
Model Ensembling: Multi-stream, multi-backbone ensembles, and stacking of "strong" and "weak" learners deliver significant reductions in utterance- and system-level MSE (Saeki et al., 2022, Baba et al., 2024).

A summary of distinctive model components from recent editions:

Approach	Key Feature Integration	Context
SSL Fine-tuning	Wav2vec2.0/HuBERT MOS regression	All tracks/editions
Hybrid Fusion	SSL + spectrogram/image (EfficientNetV2)	High-quality TTS, T05 (Baba et al., 2024)
Retrieval-based	SSL backbone + kNN non-parametric head	Zero/few-shot, zoom-in SRCC
Listener Modeling	Rater embeddings, multi-task regression	Out-of-domain, singing, noisy speech
Metadata	System/rater ID one-hot/binary injection	Analysis, regularization
Ensembling/Stacking	Multiple SSL backbones, handcrafted features	System-level MSE/SRCC

5. Major Findings and Comparative Results

Robust findings consistently validated across editions include:

SSL-based fine-tuned models dominate both in-domain and out-of-domain tracks, with system-level SRCC routinely exceeding 0.93 on main tracks and reaching 0.979 in favorable OOD settings (Huang et al., 2022, Cooper et al., 2023).
Fusion of SSL and spectral/image representations provides complementary gains, crucial for both absolute-scoring (MSE) and high-MOS system discrimination (Baba et al., 2024, Huang et al., 2024).
Retrieval-augmented and task-specific pre-training approaches enable strong generalization with limited or no in-domain MOS labels, especially in challenging zero-/few-shot and semi-supervised tracks (Huang et al., 2024, Kunešová, 31 May 2025).
Listener-dependent modeling is necessary for robust cross-domain generalization in tracks where rater bias and distributional differences are substantial (Qi et al., 2023).

Significant performance gains are achieved by carefully matching training and evaluation data distributions and leveraging domain adaptation or self-training where allowed (Cooper et al., 2023). In data-scarce conditions (less than 100 labeled utterances), simulation-driven proxy tasks and multi-task architectures sustain competitive correlation and ranking metrics (Kunešová, 31 May 2025).

6. Open Challenges and Lessons Learned

The challenge series lays bare several persistent technical problems:

Domain Shift and Zero-shot Generalization: No single model/dataset combination achieves universal performance across all domains (TTS, VC, SVS/SVC, enhancement, noisy speech). Generalization to ultra-high-quality, multi-lingual, and new architecture scenarios (e.g., zero-shot voice cloning, multi-sampling-rate inputs) remains unsolved (Huang et al., 2024, Nishikawa et al., 19 Jul 2025).
Calibrated Utterance-level Prediction: System-level ranking is reliably attainable, but utterance-level reliability (SRCC, MSE) and credible uncertainty estimates lag behind (Chinen et al., 2022).
Dataset and Metric Design: System-level performance metrics are sensitive to imbalance in utterance counts per condition; sufficient per-system data (>30 utterances) is recommended for robust error bounds (Chinen et al., 2022).
Label Efficiency & Data Augmentation: Semi-supervised/adaptively pre-trained models confirm that strong generalization is possible without large labeled datasets, but model and augmentation strategies must be tailored for small-data regimes (Kunešová, 31 May 2025, Saeki et al., 2022).

Best practices arising from challenge analysis include:

Leveraging diverse, high-quality MOS corpora and maintaining domain tags.
Utilizing multi-stage and multi-task learning regimens.
Combining ranking (contrastive) and regression (MSE) losses for stable optimization.
Preferring utterance-level metrics for small-N or highly imbalanced test sets.

7. Prospective Directions

VoiceMOS organizers and participants highlight several future directions:

Multi-task and Multi-domain Prediction Architectures: Simultaneously predict MOS, SIG/BAK/OVRL, and possibly finer-grained perceptual judgments (e.g., pairwise preference) (Huang et al., 2024).
Expansion to Broader Audio Domains: Generalization across sampling rates, audio types (e.g., music, environmental sound), and zero-shot voice cloning is an active area (Huang et al., 2024, Nishikawa et al., 19 Jul 2025).
Uncertainty Quantification: Improved modeling of confidence in utterance- and system-level predictions.
Active/Domain-adaptive Learning: Incorporating principled domain adaptation and active sampling under strict label budgets.
Advanced Feature Engineering: SFI convolutional layers and neural analog filters have demonstrated promise for sampling-frequency invariance (Nishikawa et al., 19 Jul 2025).

The VoiceMOS Challenge has thus defined the state of the art in MOS prediction, systematically pushed model development toward robust, domain-agnostic evaluators, and set high standards for reproducible, data-driven speech quality assessment (Huang et al., 2024, Cooper et al., 2023, Huang et al., 2022, Nishikawa et al., 19 Jul 2025).