The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains (2310.02640v3)
Abstract: We present the second edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthesized and processed speech. This year, we emphasize real-world and challenging zero-shot out-of-domain MOS prediction with three tracks for three different voice evaluation scenarios. Ten teams from industry and academia in seven different countries participated. Surprisingly, we found that the two sub-tracks of French text-to-speech synthesis had large differences in their predictability, and that singing voice-converted samples were not as difficult to predict as we had expected. Use of diverse datasets and listener information during training appeared to be successful approaches.
- “The VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540.
- “How do voices from past speech synthesis challenges compare today?,” in Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 2021, pp. 183–188.
- “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” in Proc. Interspeech, 2021, pp. 2127–2131.
- “ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications,” in Proc. Interspeech, 2022, pp. 3308–3312.
- “The singing voice conversion challenge 2023,” arXiv preprint arXiv:2306.14422, 2023.
- “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54–70, 2022.
- A Black and Keiichi Tokuda, “The Blizzard Challenge 2005: Evaluating corpus-based speech synthesis on common databases,” in Proc. Interspeech, 2005, pp. 77–80.
- “The Blizzard Challenge 2023,” in Proc. 18th Blizzard Challenge Workshop, Grenoble, France, August 29 2023, https://www.synsig.org/index.php/Blizzard_Challenge_2023.
- “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018.
- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in Proc. International Conference on Learning Representations, 2021.
- “The Voice Conversion Challenge 2016,” in Proc. Interspeech, 2016, pp. 1632–1636.
- “The Voice Conversion Challenge 2018: Promoting development of parallel and nonparallel methods,” in Proc. Odyssey The Speaker and Language Recognition Workshop, 2018, pp. 195–202.
- “Voice Conversion Challenge 2020 - Intra-lingual semi-parallel and cross-lingual voice conversion -,” in Proc. Joint Workshop for the BC and VCC 2020, 2020, pp. 80–98.
- “A study on incorporating Whisper for robust speech assessment,” 2023.
- “SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis,” in Proc. Interspeech 2022, 2022, pp. 2388–2392.
- “Ressources for End-to-End French Text-to-Speech Blizzard challenge,” Jan. 2023, https://doi.org/10.5281/zenodo.7560290.
- “Generalization ability of MOS prediction networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8442–8446.
- “UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022,” in Proc. Interspeech 2022, 2022, pp. 4521–4525.
- “LDNet: unified listener dependent modeling in MOS prediction for synthetic speech,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 896–900.
- “SpeechLMScore: evaluating speech generation using speech language model,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.