Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech foundation models on intelligibility prediction for hearing-impaired listeners (2401.14289v1)

Published 24 Jan 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Speech foundation models (SFMs) have been benchmarked on many speech processing tasks, often achieving state-of-the-art performance with minimal adaptation. However, the SFM paradigm has been significantly less explored for applications of interest to the speech perception community. In this paper we present a systematic evaluation of 10 SFMs on one such application: Speech intelligibility prediction. We focus on the non-intrusive setup of the Clarity Prediction Challenge 2 (CPC2), where the task is to predict the percentage of words correctly perceived by hearing-impaired listeners from speech-in-noise recordings. We propose a simple method that learns a lightweight specialized prediction head on top of frozen SFMs to approach the problem. Our results reveal statistically significant differences in performance across SFMs. Our method resulted in the winning submission in the CPC2, demonstrating its promise for speech perception applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of NAACL-HLT, 2019, pp. 4171–4186.
  2. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020, vol. 33, pp. 12449–12460.
  3. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio Speech Lang., vol. 29, pp. 3451–3460, 2021.
  4. “Robust speech recognition via large-scale weak supervision,” in Proc. of the 40th Int. Conf. on Machine Learning, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds. 23–29 Jul 2023, vol. 202 of Proceedings of Machine Learning Research, pp. 28492–28518, PMLR.
  5. “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  6. “SUPERB: Speech processing Universal PERformance Benchmark,” 2021.
  7. “The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling,” 2020.
  8. “Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers,” in Proc. Interspeech, 2023, pp. 2798–2802.
  9. “Toward a realistic model of speech processing in the brain with self-supervised learning,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., 2022, vol. 35, pp. 33428–33443.
  10. “On the Benefits of Self-supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions,” in Proc. Interspeech, 2023, pp. 1788–1792.
  11. “Clarity: Machine Learning Challenges to Revolutionise Hearing Device Processing,” in Forum Acusticum, Lyon, France, Dec. 2020, pp. 3495–3497.
  12. “Attention is All you Need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., 2017, vol. 30.
  13. “Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training,” in Proc. Interspeech, 2021, pp. 721–725.
  14. “The benefit of head orientation to speech intelligibility in noise,” The Journal of the Acoustical Society of America, vol. 139, no. 2, pp. 703–712, 02 2016.
  15. F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945.
  16. “The Hearing-Aid Speech Perception Index (HASPI),” Speech Communication, vol. 65, pp. 75–93, 2014.
  17. “Adam: A method for stochastic optimization,” in Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, 2015.
  18. “Dropout: A simple way to prevent neural networks from overfitting,” J. of Machine Learning Res., vol. 15, no. 56, pp. 1929–1958, 2014.
  19. “Non intrusive intelligibility predictor for hearing impaired individuals using self supervised speech representations,” 2023.
  20. “MBI-Net: A Non-Intrusive Multi-Branched Speech Intelligibility Prediction Model for Hearing Aids,” in Proc. Interspeech, 2022, pp. 3944–3948.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com