Hallucination in Perceptual Metric-Driven Speech Enhancement Networks (2403.11732v2)
Abstract: Within the area of speech enhancement, there is an ongoing interest in the creation of neural systems which explicitly aim to improve the perceptual quality of the processed audio. In concert with this is the topic of non-intrusive (i.e. without clean reference) speech quality prediction, for which neural networks are trained to predict human-assigned quality labels directly from distorted audio. When combined, these areas allow for the creation of powerful new speech enhancement systems which can leverage large real-world datasets of distorted audio, by taking inference of a pre-trained speech quality predictor as the sole loss function of the speech enhancement system. This paper aims to identify a potential pitfall with this approach, namely hallucinations which are introduced by the enhancement system `tricking' the speech quality predictor.
- A. Kumar, K. Tan, Z. Ni, P. Manocha, X. Zhang, E. Henderson, and B. Xu, “Torchaudio-Squim: Reference-Less Speech Quality and Intelligibility Measures in Torchaudio,” ICASSP 2023.
- S.-W. Fu, C. Yu, K.-H. Hung, M. Ravanelli, and Y. Tsao, “MetricGAN-U: Unsupervised Speech Enhancement/ Dereverberation Based Only on Noisy/ Reverberated Speech,” in ICASSP 2022.
- G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Interspeech 2021.
- G. Mittag, R. Cutler, Y. Hosseinkashi, M. Revow, S. Srinivasan, N. Chande, and R. Aichner, “DNN No-Reference PSTN Speech Quality Prediction,” in Interspeech 2020.
- B. Cauchi, K. Siedenburg, J. F. Santos, T. H. Falk, S. Doclo, and S. Goetze, “Non-Intrusive Speech Quality Prediction Using Modulation Energies and LSTM-Network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019.
- G. Yi, W. Xiao, Y. Xiao, B. Naderi, S. Möller, W. Wardah, G. Mittag, R. Cutler, Z. Zhang, D. S. Williamson et al., “ConferencingSpeech 2022 Challenge: Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge for Online Conferencing Applications,” 2022.
- A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP, 2001.
- S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement,” in Interspeech 2021.
- G. Close, T. Hain, and S. Goetze, “MetricGAN+/-: Increasing Robustness of Noise Reduction on Unseen Data,” in EUSIPCO 2022, Belgrade, Serbia, Aug. 2022.
- G. Close, S. Hollands, T. Hain, and S. Goetze, “Non-intrusive Speech Intelligibility Metric Prediction for Hearing Impaired Individuals,” in Proc. Interspeech 2022, 2022, pp. 3483–3487.
- S. Leglaive, L. Borne, E. Tzinis, M. Sadeghi, M. Fraticelli, S. Wisdom, M. Pariente, D. Pressnitzer, and J. R. Hershey, “The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement,” 2023.
- S. Leglaive, M. Fraticelli, H. ElGhazaly, L. Borne, M. Sadeghi, S. Wisdom, M. Pariente, J. R. Hershey, D. Pressnitzer, and J. P. Barker, “Objective and subjective evaluation of speech enhancement methods in the udase task of the 7th chime challenge,” 2024.
- C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6493–6497.
- G. Close, W. Ravenscroft, T. Hain, and S. Goetze, “CMGAN+/+: The University of Sheffield CHiME-7 UDASE Challenge Speech Enhancement System,” in Proc. 7th Int. Workshop on Speech Processing in Everyday Environments (CHiME 2023), Dublin, Ireland, Aug. 2023.
- B. Tamm, R. Vandenberghe, and H. Van hamme, “Analysis of xls-r for speech quality assessment,” 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust Speech Recognition via Large-Scale Weak Supervision,” 2022.
- Santiago Cuervo, Ricard Marxer, “Temporal-hierarchical features from noise-robust speech foundation models for non-intrusive intelligibility prediction,” in Clarity Workshop 2022. [Online]. Available: https://claritychallenge.org/clarity2023-workshop/papers/CPC2_E011_report.pdf
- R. Mogridge, G. Close, R. Sutherland, T. Hain, J. Barker, S. Goetze, and A. Ragni, “Non-intrusive speech intelligibility prediction for hearing-impaired users using intermediate asr features and human memory models,” in Proc. ICASSP 2024 (accepted), 2024.
- K. Shen, D. Yan, and L. Dong, “Msqat: A multi-dimension non-intrusive speech quality assessment transformer utilizing self-supervised representations,” Applied Acoustics, 2023.
- F. Dang, H. Chen, and P. Zhang, “DPT-FSNet: Dual-Path Transformer Based Full-Band and Sub-Band Fusion Network for Speech Enhancement,” in ICASSP 2022.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in ICASSP 2020.
- C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and tts models,” 2017. [Online]. Available: https://doi.org/10.7488/ds/2117
- J. Thiemann, N. Ito, and E. Vincent, “DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments,” Jun. 2013.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, 2014.
- Z. Lin, L. Zhou, and X. Qiu, “A composite objective measure on subjective evaluation of speech enhancement algorithms,” Applied Acoustics, vol. 145, 2019.
- J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr – half-baked or well done?” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
- “Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm.” International Telecommunication Union (ITU), Standard, 2003.
- George Close (9 papers)
- Thomas Hain (58 papers)
- Stefan Goetze (20 papers)