UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022 (2204.02152v2)

Published 5 Apr 2022 in cs.SD and eess.AS

Abstract: We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tests. Our system is based on ensemble learning of strong and weak learners. Strong learners incorporate several improvements to the previous fine-tuning models of self-supervised learning (SSL) models, while weak learners use basic machine-learning methods to predict scores from SSL features. In the Challenge, our system had the highest score on several metrics for both the main and OOD tracks. In addition, we conducted ablation studies to investigate the effectiveness of our proposed methods.

Authors (6)

Takaaki Saeki (22 papers)
Detai Xin (15 papers)
Wataru Nakata (9 papers)
Tomoki Koriyama (11 papers)
Shinnosuke Takamichi (70 papers)
Hiroshi Saruwatari (100 papers)

Citations (131)

View on Semantic Scholar

Summary

The paper presents an ensemble learning approach that integrates strong SSL-based models with simpler regressors to predict mean opinion scores.
The study demonstrates how contrastive learning, listener dependency, and phoneme encoding significantly enhance prediction accuracy and SRCC.
The paper’s results from the VoiceMOS Challenge 2022 highlight the potential of self-supervised methods to automate speech quality assessments.

Ensemble Learning in Speech Quality Assessment: The UTMOS System for VoiceMOS Challenge 2022

The paper "UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022" contributes to the ever-evolving domain of automatic speech quality assessment by proposing a sophisticated ensemble learning approach to predict mean opinion scores (MOS) using self-supervised learning (SSL) models. This paper was driven by the necessity to overcome the resource-intensive nature of traditional subjective evaluations in the field of speech synthesis. The researchers adeptly respond to this challenge through innovative methodologies that were rigorously tested in the VoiceMOS Challenge 2022.

Methodology and System Architecture

The UTMOS system leverages an ensemble of strong and weak learners to enhance robustness and accuracy in predicting MOS. The strong learners are derived from SSL models fine-tuned on specific tasks, incorporating several innovative strategies such as contrastive learning, listener dependency, and phoneme encoding, which significantly optimize performance. In particular, the strong learners utilize a fine-tuned SSL-based architecture where features are extracted directly from input audio waveforms, and fine-tuned through techniques like BLSTM layering and linear mapping for improved prediction accuracy.

A noteworthy component of these learners is the integration of contrastive learning, an approach that capitalizes on large-scale, unlabeled data by training models to discern intrinsic data similarities. This mechanism aids in improving rank correlation metrics, an importance underscored by its impact on the Spearman rank correlation coefficient (SRCC), a key evaluation criterion in the challenge.

Complementing the strong learners, weak learners employ basic regression models such as ridge regression and support vector regression, which are trained on SSL-extracted features. The intricate ensemble structure, employing the stacked generalization or stacking, combines insights from these diverse models to push prediction boundaries further. This multilevel learning process, partitioned into distinct stages, ensures the integration of varied data domains and methodologies, thereby enhancing predictive reliability and generalization across different speech samples.

Performance and Results

Empirical evaluations demonstrate the system's excellent performance in both the main and out-of-domain (OOD) tracks of the VoiceMOS Challenge. It achieved the highest scores on several important metrics including utterance-level MSE and system-level SRCC. The extensive experimentation, including ablation studies, illustrates the superior performance of the proposed methodologies, with listener-dependent learning and phoneme encoding proving particularly effective in enhancing prediction precision.

Implications and Future Directions

The success of the UTMOS system underscores the potential of ensemble learning and self-supervised models in MOS prediction. The results from this research have significant implications. Practically, it paves the way for wider applicability in automated systems for evaluating speech synthesis and conversion models. Theoretically, it highlights the promise of self-supervised learning techniques in fields traditionally reliant on subjective assessments.

Future work may focus on extending the system’s capabilities through data diversity and scale, exploring other self-supervised architectures, and refining phonetic analysis methods. Beyond its direct implications, this research stimulates discussions on the potential of similar methodologies in other domains of speech and audio analysis, suggesting a platform upon which further innovation can be structured.

Remarkably, the paper also details its open-source implementation, inviting broader participation and collaborative progress in this impactful research field. The UTMOS system is thus not only a technical achievement but a step towards democratizing advances in speech quality assessment.

PDF Markdown

Related Papers

GitHub

GitHub - sarulab-speech/UTMOS22: UT-Sarulab MOS prediction system using SSL models (244 stars)