Multi-Task Pseudo-Label Learning for Non-Intrusive Speech Quality Assessment Model (2308.09262v3)
Abstract: This study proposes a multi-task pseudo-label learning (MPL)-based non-intrusive speech quality assessment model called MTQ-Net. MPL consists of two stages: obtaining pseudo-label scores from a pretrained model and performing multi-task learning. The 3QUEST metrics, namely Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General-MOS (G-MOS), are the assessment targets. The pretrained MOSA-Net model is utilized to estimate three pseudo labels: perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI). Multi-task learning is then employed to train MTQ-Net by combining a supervised loss (derived from the difference between the estimated score and the ground-truth label) and a semi-supervised loss (derived from the difference between the estimated score and the pseudo label), where the Huber loss is employed as the loss function. Experimental results first demonstrate the advantages of MPL compared to training a model from scratch and using a direct knowledge transfer mechanism. Second, the benefit of the Huber loss for improving the predictive ability of MTQ-Net is verified. Finally, the MTQ-Net with the MPL approach exhibits higher overall predictive power compared to other SSL-based speech assessment models.
- “The VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540.
- P. C. Loizou, Speech enhancement: theory and practice, CRC press, 2007.
- “The 1st Clarity Prediction Challenge: A machine learning challenge for hearing aid intelligibility prediction,” in Proc. Interspeech, 2022, pp. 3508–3512.
- “ConferencingSpeech 2022 Challenge: Non-intrusive objective speech quality assessment (NISQA) challenge for online conferencing applications,” in Proc. Interspeech, 2022, pp. 3308–3312.
- “MBNet: MOS prediction for synthesized speech with mean-bias network,” in Proc. ICASSP, 2021, pp. 391–395.
- “Generalization ability of MOS prediction networks,” in Proc. ICASSP, 2022.
- “Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54–70, 2023.
- Y.-W. Chen and Y. Tsao, “InQSS: a speech intelligibility assessment model using a multi-task learning network,” in Proc. Interspeech, 2022, pp. 3088–3092.
- “MBI-Net: A non-intrusive multi-branched speech intelligibility prediction model for hearing aids,” in Proc. Interspeech, 2022, pp. 3944–3948.
- “Fusion of Self-supervised Learned Models for MOS Prediction,” in Proc. Interspeech, 2022, pp. 5443–5447.
- “UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- P. J. Huber, “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, vol. 35, no. 1, pp. 73–101, 1964.
- “New insights into the noise reduction Wiener filter,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1218–1234, 2006.
- Head Acoustics Application Note, “3QUEST: 3-fol quality evaluation of speech in telecommunications systems,” 2008.
- M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform with SincNet,” in Proc. SLT, 2018.
- “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- A. Rezayee and S. Gazor, “An adaptive KLT approach for speech enhancement,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 2, pp. 87–95, 2001.
- Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.
- “Raw waveform-based speech enhancement by fully convolutional networks,” in Proc. APSIPA ASC, 2017.
- “Speech enhancement based on deep denoising autoencoder,” in Proc. Interspeech, 2013, pp. 436–440.
- “T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement,” in Proc. ICASSP, 2020, pp. 6649–6653.
- C. Spearman, “The proof and measurement of association between two things,” The American Journal of Psychology, vol. 15, no. 1, pp. 72–101, 1904.