NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment (2309.16284v2)
Abstract: This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.
- “ViSQOL: an objective speech quality model,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, 2015.
- “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2001, vol. 2, pp. 749–752.
- “CDPAM: Contrastive learning for perceptual audio similarity,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 196–200.
- “WARP-Q: Quality prediction for generative neural speech codecs,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 401–405.
- “Non-intrusive speech quality assessment using neural networks,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 631–635.
- “More for less: Non-intrusive speech quality assessment with limited annotations,” in 2021 13th International Conference on Quality of Multimedia Experience (QoMEX). IEEE, 2021, pp. 103–108.
- “Quality-net: An end-to-end non-intrusive speech quality assessment model based on BLSTM,” in Proc. Interspeech, 2018, pp. 1873–1877.
- “SESQA: semi-supervised learning for speech quality assessment,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 381–385.
- “Wawenets: A no-reference convolutional waveform-based approach to estimating narrowband and wideband speech quality,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 331–335.
- International Telecommunication Union, “ITU-T Recommendation P.800: Methods for subjective determination of transmission quality,” 1996.
- “On some biases encountered in modern audio quality listening tests - A review,” Journal of the Audio Engineering Society, vol. 56, no. 6, pp. 427–451, 2008.
- “NORESQA: A framework for speech quality assessment using non-matching references,” Advances in Neural Information Processing Systems, vol. 34, pp. 22363–22378, 2021.
- Sensory evaluation of food: principles and practices, vol. 2, Springer, 2010.
- “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
- “Speech intelligibility prediction using a neurogram similarity index measure,” Speech Communication, vol. 54, no. 2, pp. 306–320, 2012.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- “Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction,” in Proc. Interspeech, 2022, pp. 4088–4092.
- “Multimodal emotion recognition with high-level speech and text features,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 350–357.
- “A Step Towards Preserving Speakers’ Identity While Detecting Depression Via Speaker Disentanglement,” in Proc. Interspeech, 2022, pp. 3338–3342.
- “Librispeech: an ASR corpus based on public domain audio books,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- “A Scalable Noisy Speech Dataset and Online Subjective Test Framework,” in Proc. Interspeech, 2019, pp. 1816–1820.
- “ViSQOL v3: An open source production ready objective speech and audio metric,” in 2020 Twelfth international conference on quality of multimedia experience (QoMEX). IEEE, 2020.
- Peter Kabal, “Tsp speech database,” McGill University, Database Version, vol. 1, no. 0, pp. 09–02, 2002.
- “TCD-VoIP, a research database of degraded speech for assessing quality in VoIP applications,” in 2015 Seventh International Workshop on Quality of Multimedia Experience (QoMEX). IEEE, 2015.
- “SoX - Sound eXchange, the Swiss Army knife of audio manipulation,” .
- International Telecommunication Union, “ITU-T P. Supplement 23 coded-speech database,” 1998.
- “Real time speech enhancement in the waveform domain,” Proc. Interspeech, pp. 3291–3295, 2020.
- C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and TTS models,” 2017.
- “Go listen: an end-to-end online listening test platform,” Journal of Open Research Software, vol. 9, no. 1, 2021.
- International Telecommunication Union, “ITU-R Recommendation BS.1534-3: Method for the subjective assessment of intermediate quality level of audio systems,” 2015.