Perceive and predict: self-supervised speech representation based loss functions for speech enhancement (2301.04388v3)
Abstract: Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).
- “Far-field automatic speech recognition,” Proceedings of the IEEE, vol. 109, no. 2, pp. 124–148, 2021.
- “Metricgan+: An improved version of metricgan for speech enhancement,” 2021.
- “WHAMR!: Noisy and reverberant single-channel speech separation,” in ICASSP 2020, May 2020.
- “MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition,” ASRU 2019, pp. 237–244, October 2019.
- “MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data,” in EUSIPCO 2022, Aug. 2022.
- “On loss functions for supervised monaural time-domain speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 825–838, 2020.
- “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP 2001, 2001, vol. 2, pp. 749–752 vol.2.
- “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
- “SDR – Half-baked or Well Done?,” in ICASSP 2019, May 2019.
- “Metricgan-u: Unsupervised speech enhancement/ dereverberation based only on noisy/ reverberated speech,” 2021.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 12449–12460, Curran Associates, Inc.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.
- “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021, pp. 1194–1198.
- “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” in Proc. Interspeech 2022, 2022, pp. 2278–2282.
- “A systematic comparison of phonetic aware techniques for speech enhancement,” 2022.
- “Pre-trained speech representations as feature extractors for speech quality assessment in online conferencing applications,” in Interspeech 2022. sep 2022, ISCA.
- “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 1570–1584, 2018.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of ACL 2019, Minneapolis, Minnesota, June 2019, pp. 4171–4186, Association for Computational Linguistics.
- “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., 2017, vol. 30.
- “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- “Conferencingspeech 2022 challenge: Non-intrusive objective speech quality assessment (nisqa) challenge for online conferencing applications,” 2022.
- “Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement,” 2020.
- C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and tts models,” 2017.
- “DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments,” June 2013, Supported by Inria under the Associate Team Program VERSAMUS.
- “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Interspeech 2021. aug 2021, ISCA.
- “A composite objective measure on subjective evaluation of speech enhancement algorithms,” Applied Acoustics, vol. 145, pp. 144–148, 02 2019.
- “TasNet: Time-domain audio separation network for real-time, single-channel speech separation,” in ICASSP 2018, 2018, pp. 696–700.
- “Adam: A method for stochastic optimization,” 2017.
- “Speechbrain: A general-purpose speech toolkit,” 2021.
- “Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures,” Frontiers in Signal Processing, 2022.