Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children (2403.08187v1)

Published 13 Mar 2024 in cs.CL, cs.SD, and eess.AS

Abstract: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children with SSDs is impractical. We fine-tuned the wav2vec 2.0 XLS-R model to recognize speech as pronounced rather than as existing words. The model was fine-tuned with a speech dataset from 137 children with inadequate speech production pronouncing 73 Korean words selected for actual clinical diagnosis. The model's predictions of the pronunciations of the words matched the human annotations with about 90% accuracy. While the model still requires improvement in recognizing unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Pronunciation analysis for children with speech sound disorders. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 5573–5576. IEEE, 2015. doi: 10.1109/EMBC.2015.7319655. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4710861/.
  2. Automatic analysis of pronunciations for children with speech sound disorders. Computer Speech & Language, 50:62–84, 2018. doi: 10.1016/j.csl.2017.12.006. URL https://doi.org/10.1016/j.csl.2017.12.006.
  3. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.
  4. Attention-based models for speech recognition. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/1068c6e4c8051cfd4e9ea8072e3189e2-Paper.pdf.
  5. End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4945–4949, 2016. doi: 10.1109/ICASSP.2016.7472618.
  6. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778, 2018. doi: 10.1109/ICASSP.2018.8462105.
  7. End-to-end multilingual automatic speech recognition for less-resourced languages: The case of four ethiopian languages. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7013–7017, 2021. doi: 10.1109/ICASSP39728.2021.9415020.
  8. How does pre-trained wav2vec 2.0 perform on domain-shifted asr? an extensive benchmark on air traffic control communications. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 205–212. IEEE, 2023.
  9. Simple and Effective Zero-shot Cross-lingual Phoneme Recognition. In Proc. Interspeech 2022, pages 2113–2117, 2022. doi: 10.21437/Interspeech.2022-60.
  10. Applying wav2vec2. 0 to speech recognition in various low-resource languages. arXiv preprint arXiv:2012.12121, 2020.
  11. Sed-mdd: Towards sentence dependent end-to-end mispronunciation detection and diagnosis. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3492–3496, 2020. doi: 10.1109/ICASSP40776.2020.9052975.
  12. Automatic speech disfluency detection using wav2vec2.0 for different languages with variable lengths. Applied Sciences, 13(13):7579, 2023. doi: 10.3390/app13137579. URL https://doi.org/10.3390/app13137579.
  13. A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis. In Proc. Interspeech 2021, pages 4448–4452, 2021. doi: 10.21437/Interspeech.2021-1344.
  14. End-to-end mispronunciation detection and diagnosis using transfer learning. Applied Sciences, 13(11):6793, 2023. doi: 10.3390/app13116793. URL https://doi.org/10.3390/app13116793.
  15. Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment. In Proc. Interspeech 2022, pages 4481–4485, 2022. doi: 10.21437/Interspeech.2022-11039.
  16. End-to-end model-based detection of infants with autism spectrum disorder using a pretrained model. Sensors, 23(1):202, 2023. doi: 10.3390/s23010202. URL https://doi.org/10.3390/s23010202.
  17. Deep learning and artificial intelligence applied to model speech and language in parkinson’s disease. Diagnostics, 13(13):2163, 2023. doi: 10.3390/diagnostics13132163. URL https://doi.org/10.3390/diagnostics13132163.
  18. wav2vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf.
  19. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pages 2426–2430, 2021. doi: 10.21437/Interspeech.2021-329.
  20. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech 2022, pages 2278–2282, 2022. doi: 10.21437/Interspeech.2022-143.
  21. wav2vec2-based Speech Rating System for Children with Speech Sound Disorder. In Proc. Interspeech 2022, pages 3618–3622, 2022. doi: 10.21437/Interspeech.2022-10103.
  22. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  23. Adaptation of whisper models to child speech recognition, 2023.
  24. Assessment of phonology and articulation for children (apac), 2007.
  25. Urimal test of articulation and phonology (u-tap), 2004.
  26. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (ICML ’06), pages 369–376, New York, NY, USA, 2006. Association for Computing Machinery. doi: 10.1145/1143844.1143891. URL https://doi.org/10.1145/1143844.1143891.
  27. Hugging Face. Chapter 5: Asr model - hugging face audio course. https://huggingface.co/learn/audio-course/chapter5/asr_model, 2023. Accessed: 2024-03-10.
  28. The percentage of consonants correct (pcc) metric: Extensions and reliability data. Journal of Speech, Language, and Hearing Research, 40(4):708–722, 1997.
  29. Unsupervised discovery of an extended phoneme set in l2 english speech for mispronunciation detection and diagnosis. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6244–6248, 2018. doi: 10.1109/ICASSP.2018.8462635.
  30. Development and benchmarking of a korean audio speech recognition model for clinician-patient conversations in radiation oncology clinics. International Journal of Medical Informatics, 176:105112, 2023. doi: 10.1016/j.ijmedinf.2023.105112. URL https://pubmed.ncbi.nlm.nih.gov/37276615/.
Citations (1)

Summary

We haven't generated a summary for this paper yet.