- The paper introduces MOSNet, a deep learning model that employs CNN, BLSTM, and hybrid architectures to predict MOS ratings with high system-level correlation.
- It improves traditional metrics like MCD by minimizing both utterance and frame-level MSE for a closer match to human evaluations.
- Experimental results on VCC 2018 and VCC 2016 datasets demonstrate MOSNet’s strong performance and potential for automating VC evaluation.
MOSNet: Deep Learning-based Objective Assessment for Voice Conversion
The paper introduces MOSNet, a deep learning-based approach for the objective assessment of voice conversion (VC) systems, aiming to predict human ratings of converted speech more accurately than traditional objective measures. The authors address the limitations of metrics like Mel-cepstral distance (MCD), which do not align well with human perception in assessing speech quality. The proposed system adopts convolutional and recurrent neural network architectures to develop a Mean Opinion Score (MOS) predictor that correlates well with subjective human evaluations.
Summary of Methodology
The MOSNet model utilizes raw magnitude spectrograms as input features and employs three different neural network architectures: CNN, BLSTM, and a hybrid CNN-BLSTM. These architectures are used to extract features for predicting MOS, with the CNN-BLSTM variant demonstrating the best performance across the board. The model is trained to minimize both utterance and frame-level mean squared error (MSE), thereby aligning predictions closely with human ratings. Notably, a novel objective function incorporates frame-level errors, enhancing the model's ability to predict utterance-level MOS.
Experimental Validation and Results
The experiments leverage data from the Voice Conversion Challenge (VCC) 2018, consisting of comprehensive human evaluations of converted voice samples. Results indicate that MOSNet's predictions exhibit high correlation (LCC of up to 0.957) with human scores at the system level and fair correlation at the utterance level (LCC up to 0.642), outperforming existing methods. The paper also demonstrates the model's generalization capability by applying its training results from VCC 2018 to the VCC 2016 data, maintaining strong correlation metrics.
Furthermore, MOSNet's architecture is slightly modified to predict similarity scores between converted and target speech samples with fair correlation results, indicating its versatility in assessing both naturalness and similarity in VC systems.
Implications and Future Work
The introduction of MOSNet provides a meaningful advancement for automating VC evaluation, which traditionally depends on resource-intensive human evaluations. This model lays the groundwork for deploying non-intrusive evaluation systems capable of consistently predicting perceptual quality metrics, thereby reducing evaluation costs and potentially accelerating developments in VC technology.
Theoretical implications center on aligning machine learning models with human perception, encouraging further research into integrating perceptual theory with neural architectures. Future work may explore alternative alignment methods between human perception and computational models, potentially incorporating psychoacoustic principles into MOSNet’s architecture to further bridge the discrepancy between machine predictions and human evaluations.
The paper opens avenues for refining objective speech assessment methodologies, possibly extending MOSNet’s applicability to other domains of speech processing, such as speech synthesis or speech enhancement. The potential to refine model components, improve generalization to diverse datasets, and incorporate additional acoustic features remains a subject of future exploration for both academia and industry stakeholders.