General-purpose MOS prediction across diverse speech domains

Determine whether a single automatic mean opinion score (MOS) prediction model can achieve consistently high performance across heterogeneous speech domains and listening test contexts—such as French text-to-speech synthesis (Blizzard Challenge 2023), singing voice conversion (SVCC 2023), and noisy/enhanced speech (TMHINT-QI(S))—when trained on the same dataset without per-domain adaptation, thereby establishing truly general-purpose MOS prediction capability.

Background

The 2023 edition of the VoiceMOS Challenge deliberately emphasized out-of-domain evaluation by using three distinct tracks: French text-to-speech (BC2023), singing voice conversion (SVCC2023), and noisy/enhanced speech (TMHINT-QI(S)). No official training data was provided for two tracks, and the third used a separate listening test, creating realistic generalization conditions.

Results showed that team performances varied markedly across tracks, and no single model trained on the same data achieved high scores on all tracks, indicating that building a general-purpose MOS predictor that generalizes well across diverse speech tasks and listening test contexts remains unresolved.

References

The most important result was that most teams' scores for the different tracks are very different, and no team had high scores on all tracks using the same model trained on the same data, indicating that general-purpose MOS prediction can still be considered an open research problem.

Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities (2508.00317 - Huang, 1 Aug 2025) in Subsection "Results and insights" under "The VoiceMOS Challenge 2023"