Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Pathological Speech Quality Assessment with ASR-Powered Wav2Vec2 in Data-Scarce Context (2403.20184v1)

Published 29 Mar 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Automatic speech quality assessment has raised more attention as an alternative or support to traditional perceptual clinical evaluation. However, most research so far only gains good results on simple tasks such as binary classification, largely due to data scarcity. To deal with this challenge, current works tend to segment patients' audio files into many samples to augment the datasets. Nevertheless, this approach has limitations, as it indirectly relates overall audio scores to individual segments. This paper introduces a novel approach where the system learns at the audio level instead of segments despite data scarcity. This paper proposes to use the pre-trained Wav2Vec2 architecture for both SSL, and ASR as feature extractor in speech assessment. Carried out on the HNC dataset, our ASR-driven approach established a new baseline compared with other approaches, obtaining average $MSE=0.73$ and $MSE=1.15$ for the prediction of intelligibility and severity scores respectively, using only 95 training samples. It shows that the ASR based Wav2Vec2 model brings the best results and may indicate a strong correlation between ASR and speech quality assessment. We also measure its ability on variable segment durations and speech content, exploring factors influencing its decision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Sondes Abderrazek. 2023. Assessment of Speech Intelligibility using Deep Learning – Towards Enhanced Interpretability in Clinical Phonetics. Theses, Université d’Avignon.
  2. Interpretable assessment of speech intelligibility using deep learning: A case study on speech disorders due to head and neck cancerst. In 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Torino, Italia.
  3. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
  4. wav2vec 2.0: a framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 2020, Red Hook, NY, USA. Curran Associates Inc.
  5. Development of a holistic communication score (HoCoS) in patients treated for oral or oropharyngeal cancer: Preliminary validation. International Journal of Language and Communication Disorders, 58(1):39–51.
  6. A theoretical analysis of feature pooling in visual recognition. In ICML 2010 - Proceedings, 27th International Conference on Machine Learning, pages 111–118, Haifa, Israel.
  7. Eduardo Castillo Guerra and Denis F. Lovey. 2003. A modern approach to dysarthria classification. In Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No.03CH37439), volume 3, pages 2257–2260 Vol.3, Cancun, Mexico.
  8. Evidence of vocal tract articulation in self-supervised learning of speech. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, Rhodes Island, Greece.
  9. A comparative study of adaptive, automatic recognition of disordered speech. In Proc. Interspeech 2012, pages 1776–1779, Portland, OR, USA.
  10. LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech. In Proc. Interspeech 2021, pages 1439–1443, Brno, Czech Republic.
  11. How to manage sound, physiological and clinical data of 2500 dysphonic and dysarthric speakers? Speech Communication, 54(5):664–679. Advanced Voice Function Assessment.
  12. Exploring self-supervised pre-trained asr models for dysarthric and elderly speech recognition. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, Rhodes Island, Greece.
  13. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations.
  14. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210.
  15. Attentive Statistics Pooling for Deep Speaker Embedding. In Proc. Interspeech 2018, pages 2252–2256, Hyderabad, India.
  16. Layer-wise analysis of a self-supervised speech representation model. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 914–921, Cartagena, Colombia.
  17. Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer. In Proc. Interspeech 2020, pages 4976–4980, Shangai, China.
  18. Can we use speaker embeddings on spontaneous speech obtained from medical conversations to predict intelligibility? In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–7, Taipei, Taiwan.
  19. Speechbrain: A general-purpose speech toolkit.
  20. A multisensor data acquisition and processing system for speech production investigation. In International Congress of Phonetic Sciences (ICPhS), pages 2251–2254, San Francisco, United States. University of California.
  21. Utilizing wav2vec in database-independent voice disorder detection. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, Rhodes Island, Greece.
  22. Hierarchical multi-class classification of voice disorders using self-supervised models and glottal features. IEEE Open Journal of Signal Processing, 4:80–88.
  23. Speech technology-based assessment of phoneme intelligibility in dysarthria. International Journal of Language & Communication Disorders, 44(5):716–730.
  24. Automatic Extraction of Speech Rhythm Descriptors for Speech Intelligibility Assessment in the Context of Head and Neck Cancers. In Proc. Interspeech 2021, pages 1912–1916, Brno, Czech Republic.
  25. Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition. In Proc. Interspeech 2022, pages 41–45, Incheon, Korea.
  26. C2si corpus: a database of speech disorder productions to assess intelligibility and quality of life in head and neck cancers. Language Resources and Evaluation, 55:173–190.
Citations (1)

Summary

We haven't generated a summary for this paper yet.