MOSNet: Deep Learning based Objective Assessment for Voice Conversion (1904.08352v3)

Published 17 Apr 2019 in cs.SD, cs.LG, and eess.AS

Abstract: Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech. We adopt the convolutional and recurrent neural network models to build a mean opinion score (MOS) predictor, termed as MOSNet. The proposed models are tested on large-scale listening test results of the Voice Conversion Challenge (VCC) 2018. Experimental results show that the predicted scores of the proposed MOSNet are highly correlated with human MOS ratings at the system level while being fairly correlated with human MOS ratings at the utterance level. Meanwhile, we have modified MOSNet to predict the similarity scores, and the preliminary results show that the predicted scores are also fairly correlated with human ratings. These results confirm that the proposed models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating.

Citations (251)

View on Semantic Scholar

Summary

The paper introduces MOSNet, a deep learning model that employs CNN, BLSTM, and hybrid architectures to predict MOS ratings with high system-level correlation.
It improves traditional metrics like MCD by minimizing both utterance and frame-level MSE for a closer match to human evaluations.
Experimental results on VCC 2018 and VCC 2016 datasets demonstrate MOSNet’s strong performance and potential for automating VC evaluation.

MOSNet: Deep Learning-based Objective Assessment for Voice Conversion

The paper introduces MOSNet, a deep learning-based approach for the objective assessment of voice conversion (VC) systems, aiming to predict human ratings of converted speech more accurately than traditional objective measures. The authors address the limitations of metrics like Mel-cepstral distance (MCD), which do not align well with human perception in assessing speech quality. The proposed system adopts convolutional and recurrent neural network architectures to develop a Mean Opinion Score (MOS) predictor that correlates well with subjective human evaluations.

Summary of Methodology

The MOSNet model utilizes raw magnitude spectrograms as input features and employs three different neural network architectures: CNN, BLSTM, and a hybrid CNN-BLSTM. These architectures are used to extract features for predicting MOS, with the CNN-BLSTM variant demonstrating the best performance across the board. The model is trained to minimize both utterance and frame-level mean squared error (MSE), thereby aligning predictions closely with human ratings. Notably, a novel objective function incorporates frame-level errors, enhancing the model's ability to predict utterance-level MOS.

Experimental Validation and Results

The experiments leverage data from the Voice Conversion Challenge (VCC) 2018, consisting of comprehensive human evaluations of converted voice samples. Results indicate that MOSNet's predictions exhibit high correlation (LCC of up to 0.957) with human scores at the system level and fair correlation at the utterance level (LCC up to 0.642), outperforming existing methods. The paper also demonstrates the model's generalization capability by applying its training results from VCC 2018 to the VCC 2016 data, maintaining strong correlation metrics.

Furthermore, MOSNet's architecture is slightly modified to predict similarity scores between converted and target speech samples with fair correlation results, indicating its versatility in assessing both naturalness and similarity in VC systems.

Implications and Future Work

The introduction of MOSNet provides a meaningful advancement for automating VC evaluation, which traditionally depends on resource-intensive human evaluations. This model lays the groundwork for deploying non-intrusive evaluation systems capable of consistently predicting perceptual quality metrics, thereby reducing evaluation costs and potentially accelerating developments in VC technology.

Theoretical implications center on aligning machine learning models with human perception, encouraging further research into integrating perceptual theory with neural architectures. Future work may explore alternative alignment methods between human perception and computational models, potentially incorporating psychoacoustic principles into MOSNet’s architecture to further bridge the discrepancy between machine predictions and human evaluations.

The paper opens avenues for refining objective speech assessment methodologies, possibly extending MOSNet’s applicability to other domains of speech processing, such as speech synthesis or speech enhancement. The potential to refine model components, improve generalization to diverse datasets, and incorporate additional acoustic features remains a subject of future exploration for both academia and industry stakeholders.

PDF Markdown