A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality (2204.02249v2)
Abstract: Speech synthesis quality prediction has made remarkable progress with the development of supervised and self-supervised learning (SSL) MOS predictors but some aspects related to the data are still unclear and require further study. In this paper, we evaluate several MOS predictors based on wav2vec 2.0 and the NISQA speech quality prediction model to explore the role of the training data, the influence of the system type, and the role of cross-domain features in SSL models. Our evaluation is based on the VoiceMOS challenge dataset. Results show that SSL-based models show the highest correlation and lowest mean squared error compared to supervised models. The key point of this study is that benchmarking the statistical performance of MOS predictors alone is not sufficient to rank models since potential issues hidden in the data could bias the evaluated performances.
- “A review of deep learning based speech synthesis,” Applied Sciences, vol. 9, no. 19, pp. 4050, 2019.
- “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2020.
- “AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech,” NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop, 2016.
- “Mosnet: Deep learning-based objective assessment for voice conversion,” Proc. Interspeech 2019, pp. 1541–1545, 2019.
- “Ldnet: Unified listener dependent modeling in mos prediction for synthetic speech,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 896–900.
- “NISQA: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” INTERSPEECH, pp. 2127–2131, 2021.
- “Self-supervised representation learning: Introduction, advances, and challenges,” IEEE Signal Processing Magazine, vol. 39, no. 3, pp. 42–62, 2022.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
- “Generalization ability of mos prediction networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8442–8446.
- “Measuring the naturalness of synthetic speech,” International journal of speech technology, vol. 2, no. 1, pp. 7–19, 1997.
- “How do voices from past speech synthesis challenges compare today?,” Proceedings of the 11th ISCA Speech Synthesis Workshop, 2021., 2021.
- “DNN No-Reference PSTN Speech Quality Prediction,” Proc. Interspeech 2020, pp. 2867–2871, 2020.
- “More for less: Non-intrusive speech quality assessment with limited annotations,” in 2021 13th International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 2021, pp. 103–108.
- “Novel deep autoencoder features for non-intrusive speech quality assessment,” in 2016 24th European Signal Processing Conference (EUSIPCO). IEEE, 2016, pp. 2315–2319.
- “NAViDAd: A no-reference audio-visual quality metric based on a deep autoencoder,” in 2019 27th European Signal Processing Conference (EUSIPCO). IEEE, 2019, pp. 1–5.
- “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- International Telecommunication Union, “Methods, metrics and procedures for statistical evaluation, qualification and comparison of objective quality prediction models,” SERIES P: TELEPHONE TRANSMISSION QUALITY, TELEPHONE INSTALLATIONS, LOCAL LINE NETWORKS, 2020.
- “Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset,” Proc. Interspeech, pp. 4531–4535, 2022.
- Alessandro Ragano (14 papers)
- Emmanouil Benetos (89 papers)
- Michael Chinen (12 papers)
- Helard B. Martinez (1 paper)
- Chandan K. A. Reddy (19 papers)
- Jan Skoglund (23 papers)
- Andrew Hines (27 papers)