The VoiceMOS Challenge 2022 (2203.11389v3)

Published 21 Mar 2022 in cs.SD and eess.AS

Abstract: We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.

Citations (95)

View on Semantic Scholar

Summary

The paper demonstrates that SSL-finetuned models outperform others in MOS prediction for synthetic speech.
It analyzes 22 teams across main and OOD tracks, highlighting challenges with unseen systems and speaker variability.
The findings underscore the need for enhanced domain adaptation to replace costly human listening tests with automated evaluations.

An Analysis of the VoiceMOS Challenge 2022

The research presented in "The VoiceMOS Challenge 2022" paper addresses the nascent but crucial area of automatic mean opinion score (MOS) prediction for synthetic speech. The challenge represents a pivotal step towards developing robust, generalized models capable of assessing the quality of synthesized speech systems without direct human involvement. It is an ambitious effort that combines efforts from researchers across academia and industry, underscoring the collaborative nature of advancing machine learning in speech processing.

Overview of the Challenge

The VoiceMOS Challenge 2022 invited 22 teams to participate in predicting the MOS of synthesized speech across both a main and an out-of-domain (OOD) track. The primary goal was to develop models capable of replicating human perception of speech naturalness, often judged through MOS in listening tests. The datasets spanned 187 text-to-speech (TTS) and voice conversion (VC) systems over ten years, along with a more recent set of systems for the OOD track, posing substantial challenges in terms of unseen speaker, listener, and system generalization capabilities.

Methodological Insights

Participating teams utilized a plethora of methodologies, highlighting fine-tuning strategies employing self-supervised learning (SSL) models as particularly effective. This observation aligns with the broader trend in machine learning, where SSL models have shown strong performance across various domains due to their ability to learn complex representations from large amounts of unannotated data. The challenge underscored the difficulty of predicting MOS for unseen systems and listeners, especially in the OOD context, identifying generalization as a major area requiring improvement.

Key Findings and Implications

Two primary takeaways emerged from the results. First, SSL-finetuned models generally outperformed others, demonstrating the benefit of leveraging large-scale, pretrained models even in specialized tasks like MOS prediction. Second, unseen categories such as new synthesis systems and speaker variability continue to present significant hurdles, with a measurable performance drop-off when models are exposed to unfamiliar data distributions.

From a theoretical perspective, this work highlights the necessity of improving domain adaptation techniques and developing strategies to mitigate the impact of training-test distribution mismatches. Practically, the challenge pushes the boundaries of speech synthesis evaluation, aiming to replace costly and time-consuming human listening tests with automated, reliable systems.

Looking Ahead

The VoiceMOS Challenge establishes a foundation for future research, advocating for enhanced generalization in MOS prediction models. These efforts will need to address the limitations identified, particularly within OOD settings, to further refine model accuracy and applicability. As synthetic speech systems continue to evolve, the ability of evaluation frameworks to keep pace with these advances will be crucial, not only for academic investigations but also for real-world applications where synthetic voices are increasingly commonplace.

In summary, the VoiceMOS Challenge is a critical endeavor in the ongoing dialogue between advancing synthetic speech technologies and their evaluation frameworks. The results and insights from this challenge not only provide a benchmark for current capabilities but also illuminate the path forward for future research and development in automatic speech assessment methodologies.

PDF Markdown

Related Papers

YouTube

Show All Videos