Utilizing Self-supervised Representations for MOS Prediction (2104.03017v3)

Published 7 Apr 2021 in eess.AS, cs.LG, and cs.SD

Abstract: Speech quality assessment has been a critical issue in speech processing for decades. Existing automatic evaluations usually require clean references or parallel ground truth data, which is infeasible when the amount of data soars. Subjective tests, on the other hand, do not need any additional clean or parallel data and correlates better to human perception. However, such a test is expensive and time-consuming because crowd work is necessary. It thus becomes highly desired to develop an automatic evaluation approach that correlates well with human perception while not requiring ground truth data. In this paper, we use self-supervised pre-trained models for MOS prediction. We show their representations can distinguish between clean and noisy audios. Then, we fine-tune these pre-trained models followed by simple linear layers in an end-to-end manner. The experiment results showed that our framework outperforms the two previous state-of-the-art models by a significant improvement on Voice Conversion Challenge 2018 and achieves comparable or superior performance on Voice Conversion Challenge 2016. We also conducted an ablation study to further investigate how each module benefits the task. The experiment results are implemented and reproducible with publicly available toolkits.

Citations (59)

View on Semantic Scholar

Summary

The paper introduces a novel framework for MOS prediction using self-supervised representations that eliminates reliance on clean reference data.
It employs segmental embeddings and attention pooling to capture audio quality nuances and align more closely with human perception.
Experimental results on VCC 2016 and 2018 datasets demonstrate superior performance, underscoring the importance of segmental embeddings and a bias network.

Utilizing Self-supervised Representations for MOS Prediction

This paper addresses the long-standing issue of speech quality assessment, a critical aspect in speech processing that traditionally relies on either clean reference data or subjective human evaluations to predict the Mean Opinion Score (MOS). By leveraging self-supervised learning models, the authors propose a novel approach that eliminates the need for clean reference data and aligns closely with human perception.

The research utilizes self-supervised representations derived from models such as wav2vec 2.0, CPC, TERA, and APC to predict MOS. These models are pre-trained on extensive datasets like LibriSpeech and Libri-Light, allowing them to efficiently capture audio quality nuances. The work demonstrates that such models can inherently distinguish between high- and low-quality audios without the need for prior fine-tuning.

Key Components of the Framework

Segmental Embeddings: The paper emphasizes evaluating the audio at a segment level rather than at individual frame level, which aligns better with human perception. This strategy improves model performance significantly.
Attention Pooling: Despite segment embeddings being the focus, attention pooling is incorporated to calculate segment-level representations which aggregate the frame-level features effectively.
Bias Network: The inclusion of a bias network allows the framework to account for subjective evaluations biases from individual judges, enhancing the model’s ability to generalize across different evaluation systems.
Range Clipping: By constraining the predicted scores within a defined range, range clipping ensures the outputs remain within practical boundaries, thus maintaining stability and predictability.

Experimental Evaluation

The paper presents robust results on both the VCC 2016 and VCC 2018 datasets. The self-supervised models not only surpassed previous state-of-the-art methods but showed superior performance in both utterance-level and system-level MOS predictions. Importantly, wav2vec 2.0 emerged as a top performer in these tasks.

The attention to ablation studies revealed each module's contribution to overall performance, with segmental embeddings identified as a crucial factor. Without segmental embeddings, model performance diminished, underscoring their importance in capturing audio quality effectively.

Implications and Future Directions

The findings suggest significant implications for the development of automated speech quality assessment systems. By reducing dependence on clean reference data and human evaluations, this methodology could lead to more efficient and scalable evaluation systems in various speech processing applications.

The paper opens avenues for future research, particularly in exploring other self-supervised models and expanding the framework to more diverse and complex datasets. Further investigation into improving the bias network could also yield even more accurate predictions.

In conclusion, the paper presents a comprehensive methodology that combines the strengths of self-supervised learning with specialized mechanisms to enhance MOS prediction. This work represents a meaningful advancement in the field of automated speech quality assessment, providing a promising foundation for further innovations.

PDF Markdown

Related Papers

GitHub

GitHub - s3prl/s3prl: Self-Supervised Speech Pre-training and Representation Learning Toolkit (2,423 stars)