Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO (2411.00980v2)

Published 1 Nov 2024 in cs.CL, cs.HC, cs.SD, and eess.AS

Abstract: Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns. In healthcare settings, communication breakdowns reduce the quality of care. While building an augmentative and alternative communication (AAC) tool to enable fluid communication we found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data. Our work looks to leverage SOTA ASR followed by domain specific error-correction. English dysarthric ASR performance is often evaluated on the TORGO dataset. Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. After reducing prompt-overlap, results with SOTA ASR models produce extremely high word error rates for speakers with mild and severe dysarthria. Furthermore, to improve ASR, our work looks at the impact of n-gram LLMs and large-LLM based multi-modal generative error-correction algorithms like Whispering-LLaMA for a second pass ASR. Our work highlights how much more needs to be done to improve ASR for atypical speakers to enable equitable healthcare access both in-person and in e-health settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. C. Handberg and A. K. Voss, “Implementing augmentative and alternative communication in critical care settings: Perspectives of healthcare professionals,” Journal of Clinical Nursing, vol. 27, no. 1-2, pp. 102–114, 2018.
  2. A. Mohan, M. Chakraborti, K. Eng, N. Kushaeva, M. Prpa, J. Lewis, T. Zhang, V. Geisler, and C. Geisler, “A powerful and modern AAC composition tool for impaired speakers,” Interspeech 2024: Show and Tell Demo, 2024.
  3. D. Mulfari, L. Carnevale, A. Galletta, and M. Villari, “Edge computing solutions supporting voice recognition services for speakers with dysarthria,” in 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW).   IEEE, 2023, pp. 231–236.
  4. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
  5. G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “Gigaspeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
  6. Y. Li, P. Chen, P. Bell, and C. Lai, “Crossmodal ASR error correction with discrete speech units,” arXiv preprint arXiv:2405.16677, 2024.
  7. F. Rudzicz, A. K. Namasivayam, and T. Wolff, “The torgo database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, vol. 46, pp. 523–541, 2012.
  8. X. Menendez-Pidal, J. B. Polikoff, S. M. Peters, J. E. Leonzio, and H. T. Bunnell, “The nemours database of dysarthric speech,” in Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, vol. 3.   IEEE, 1996, pp. 1962–1965.
  9. M. Hasegawa-Johnson, “Universal access automatic speech recognition project,” 2006.
  10. M. Nicolao, H. Christensen, S. Cunningham, P. Green, and T. Hain, “A framework for collecting realistic recordings of dysarthric speech-the homeservice corpus,” in Proceedings of LREC 2016.   European Language Resources Association, 2016.
  11. E. Hermann and M. M. Doss, “Dysarthric speech recognition with lattice-free mmi,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6109–6113.
  12. Z. Yue, F. Xiong, H. Christensen, and J. Barker, “Exploring appropriate acoustic and language modelling choices for continuous dysarthric speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6094–6098.
  13. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  14. S. Radhakrishnan, C.-H. Yang, S. Khan, R. Kumar, N. Kiani, D. Gomez-Cabrero, and J. Tegnér, “Whispering LLaMA: A cross-modal generative error correction framework for speech recognition,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 10 007–10 016. [Online]. Available: https://aclanthology.org/2023.emnlp-main.618
  15. C. Espana-Bonet and J. A. Fonollosa, “Automatic speech recognition with deep neural networks for impaired speech,” in Advances in Speech and Language Technologies for Iberian Languages: Third International Conference, IberSPEECH 2016, Lisbon, Portugal, November 23-25, 2016, Proceedings 3.   Springer, 2016, pp. 97–107.
  16. N. M. Joy and S. Umesh, “Improving acoustic models in torgo dysarthric speech database,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 26, no. 3, pp. 637–645, 2018.
  17. S. Dutta, S. Jain, A. Maheshwari, S. Pal, G. Ramakrishnan, and P. Jyothi, “Error correction in asr using sequence-to-sequence models,” arXiv preprint arXiv:2202.01157, 2022.
  18. Y. Leng, X. Tan, W. Liu, K. Song, R. Wang, X.-Y. Li, T. Qin, E. Lin, and T.-Y. Liu, “Softcorrect: Error correction with soft detection for automatic speech recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 034–13 042.
  19. C. Park, Y. Jang, S. Lee, J. Seo, K. Yang, and H.-S. Lim, “Pictalky: Augmentative and alternative communication for language developmental disabilities,” in Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations, 2022, pp. 17–27.
  20. L. Perron and V. Furnon, “OR-Tools,” Google, 2023. [Online]. Available: https://developers.google.com/optimization/
  21. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  22. A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” arXiv preprint arXiv:2006.13979, 2020.
  23. A. Hernandez, P. A. Pérez-Toro, E. Nöth, J. R. Orozco-Arroyave, A. Maier, and S. H. Yang, “Cross-lingual self-supervised speech representations for improved dysarthric speech recognition,” arXiv preprint arXiv:2204.01670, 2022.
  24. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  25. K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation, 2011, pp. 187–197.
  26. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  27. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.

Summary

We haven't generated a summary for this paper yet.