Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech (2402.16321v1)

Published 26 Feb 2024 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Outliernets: Highly compact deep autoencoder network architectures for on-device acoustic anomaly detection. Sensors, 21(14):4805, 2021.
  2. Variational autoencoder based anomaly detection using reconstruction probability. Special lecture on IE, 2(1):1–18, 2015.
  3. Recent advances in adversarial training for adversarial robustness. arXiv preprint arXiv:2102.01356, 2021.
  4. Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment. Journal of the Audio Engineering Society, 61(6):366–384, 2013.
  5. Wawenets: A no-reference convolutional waveform-based approach to estimating narrowband and wideband speech quality. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  331–335. IEEE, 2020.
  6. Visqol v3: An open source production ready objective speech and audio metric. In 2020 twelfth international conference on quality of multimedia experience (QoMEX), pp.  1–6. IEEE, 2020.
  7. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pp. 3915–3924. PMLR, 2022.
  8. How do voices from past speech synthesis challenges compare today? arXiv preprint arXiv:2105.02373, 2021.
  9. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847, 2020.
  10. An attention enhanced multi-task model for objective speech assessment in real-world environments. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  911–915. IEEE, 2020.
  11. A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:1493–1505, 2020.
  12. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 1984.
  13. Quality-net: An end-to-end non-intrusive speech quality assessment model based on blstm. arXiv preprint arXiv:1808.05344, 2018.
  14. Metricgan-u: Unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7412–7416. IEEE, 2022.
  15. Noisy-target training: A training strategy for dnn-based speech enhancement without clean speech. In 2021 29th European Signal Processing Conference (EUSIPCO), pp.  436–440. IEEE, 2021.
  16. Group masked autoencoder based density estimator for audio anomaly detection. 2020.
  17. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  18. Conformer-based id-aware autoencoder for unsupervised anomalous sound detection. DCASE2020 Challenge, Tech. Rep., 2020.
  19. Wav2code: Restore clean speech representations via codebook lookup for noise-robust asr. arXiv preprint arXiv:2304.04974, 2023.
  20. The voicemos challenge 2022. arXiv preprint arXiv:2203.11389, 2022.
  21. The hearing-aid speech quality index (hasqi) version 2. Journal of the Audio Engineering Society, 62(3):99–117, 2014a.
  22. The hearing-aid speech perception index (haspi). Speech Communication, 65:75–93, 2014b.
  23. Minje Kim. Collaborative deep learning for speech enhancement: A run-time model selection method using autoencoders. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  76–80. IEEE, 2017.
  24. Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. arXiv preprint arXiv:2304.01448, 2023.
  25. Nu-gan: High resolution neural upsampling with gan. arXiv preprint arXiv:2010.11362, 2020.
  26. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  626–630. IEEE, 2019.
  27. Mbnet: Mos prediction for synthesized speech with mean-bias network. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  391–395. IEEE, 2021.
  28. Frame-level signal-to-noise ratio estimation using deep learning. In INTERSPEECH, pp.  4626–4630, 2020.
  29. Noise tokens: Learning neural noise templates for environment-aware speech enhancement. arXiv preprint arXiv:2004.04001, 2020.
  30. Voicefixer: A unified framework for high-fidelity speech restoration. arXiv preprint arXiv:2204.05841, 2022.
  31. Mosnet: Deep learning based objective assessment for voice conversion. arXiv preprint arXiv:1904.08352, 2019.
  32. Philipos C Loizou. Speech enhancement: theory and practice. CRC press, 2013.
  33. Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  206–210. IEEE, 2020.
  34. Speechlmscore: Evaluating speech generation using speech language model. arXiv preprint arXiv:2212.04559, 2022.
  35. Speech quality assessment through mos using non-matching references. arXiv preprint arXiv:2206.12285, 2022.
  36. Noresqa: A framework for speech quality assessment using non-matching references. Advances in Neural Information Processing Systems, 34, 2021.
  37. Navidad: A no-reference audio-visual quality metric based on a deep autoencoder. In 2019 27th European Signal Processing Conference (EUSIPCO), pp.  1–5. IEEE, 2019.
  38. Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. arXiv preprint arXiv:2104.09494, 2021.
  39. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  40. Karl Pearson. Notes on the history of correlation. Biometrika, 13(1):25–45, 1920.
  41. Using deep autoencoders for in-vehicle audio anomaly detection. Procedia Computer Science, 192:298–307, 2021.
  42. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
  43. The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results. arXiv preprint arXiv:2005.13981, 2020.
  44. Interspeech 2021 deep noise suppression challenge. arXiv preprint arXiv:2101.01902, 2021a.
  45. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6493–6497. IEEE, 2021b.
  46. Dnsmos p. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  886–890. IEEE, 2022.
  47. Deep dense and convolutional autoencoders for unsupervised anomaly detection in machine condition sounds. arXiv preprint arXiv:2006.10417, 2020.
  48. Voices obscured in complex environmental settings (voices) corpus. arXiv preprint arXiv:1804.05053, 2018.
  49. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, pp. 749–752. IEEE, 2001.
  50. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99, 2000.
  51. Universal speech enhancement with score-based diffusion. arXiv preprint arXiv:2206.03065, 2022.
  52. Novel deep autoencoder features for non-intrusive speech quality assessment. In 2016 24th European Signal Processing Conference (EUSIPCO), pp.  2315–2319. IEEE, 2016.
  53. Cosine-a corpus of multi-party conversational speech in noisy environments. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.  4153–4156. IEEE, 2009.
  54. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011.
  55. Utilizing self-supervised representations for mos prediction. arXiv preprint arXiv:2104.03017, 2021.
  56. Ddos: A mos prediction framework utilizing domain adaptive pre-training and distribution of opinion scores. arXiv preprint arXiv:2204.03219, 2022.
  57. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  58. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In SSW, pp.  146–152, 2016.
  59. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  60. Output-based speech quality assessment using autoencoder and support vector regression. Speech Communication, 110:13–20, 2019.
  61. Self-supervised learning for speech enhancement. arXiv preprint arXiv:2006.10388, 2020.
  62. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
  63. Deep noise suppression maximizing non-differentiable pesq mediated by a non-intrusive pesqnet. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022.
  64. Conferencingspeech 2022 challenge evaluation plan. 2022.
  65. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021a.
  66. Metricnet: Towards improved modeling for non-intrusive speech quality assessment. arXiv preprint arXiv:2104.01227, 2021b.
  67. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  68. Stoi-net: A deep learning based non-intrusive speech intelligibility assessment model. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.  482–486. IEEE, 2020.
  69. Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:54–70, 2022.
  70. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  3713–3722, 2019.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com