Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction (2312.15616v1)

Published 25 Dec 2023 in cs.SD, eess.AS, and stat.ML

Abstract: Predicting audio quality in voice synthesis and conversion systems is a critical yet challenging task, especially when traditional methods like Mean Opinion Scores (MOS) are cumbersome to collect at scale. This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings where extensive MOS data from large-scale listening tests may be unavailable. We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning (SSL) models, such as wav2vec, correlate with MOS scores. These findings are based on data from the 2022 and 2023 VoiceMOS challenges. We explore the extent of this correlation across different models and language contexts, revealing insights into how inherent uncertainties in SSL models can serve as effective proxies for audio quality assessment. In particular, we show that the contrastive wav2vec models are the most performant in all settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. “The voicemos challenge 2022,” 2022.
  2. “The voicemos challenge 2023: Zero-shot subjective speech quality prediction for multiple domains,” 2023.
  3. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020.
  4. “Language models enable zero-shot prediction of the effects of mutations on protein function,” bioRxiv, 2021.
  5. “On the use of automatic speech recognizers for the quality and intelligibility prediction of synthetic speech,” in Konferenz Elektronische Sprachsignalverarbeitung. TUDpress, Dresden, 2015, pp. 105–111.
  6. “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” 2018.
  7. “Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models,” 2022.
  8. “Layer-wise analysis of a self-supervised speech representation model,” 2022.
  9. “wav2vec: Unsupervised pre-training for speech recognition,” 2019.
  10. “Speechlmscore: Evaluating speech generation using speech language model,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  11. “Simple and effective zero-shot cross-lingual phoneme recognition,” 2021.
  12. “fairseq: A fast, extensible toolkit for sequence modeling,” 2019.
  13. “Torchaudio: Building blocks for audio and speech processing,” arXiv preprint arXiv:2110.15018, 2021.
  14. “Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch,” 2023.
  15. “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” 2016.
  16. “Ddos: A mos prediction framework utilizing domain adaptive pre-training and distribution of opinion scores,” 2022.
  17. “Dimensionality reduction as probabilistic inference,” 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.