Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning (2403.00854v1)
Abstract: Dysarthria, a condition resulting from impaired control of the speech muscles due to neurological disorders, significantly impacts the communication and quality of life of patients. The condition's complexity, human scoring and varied presentations make its assessment and management challenging. This study presents a transformer-based framework for automatically assessing dysarthria severity from raw speech data. It can offer an objective, repeatable, accessible, standardised and cost-effective and compared to traditional methods requiring human expert assessors. We develop a transformer framework, called Speaker-Agnostic Latent Regularisation (SALR), incorporating a multi-task learning objective and contrastive learning for speaker-independent multi-class dysarthria severity classification. The multi-task framework is designed to reduce reliance on speaker-specific characteristics and address the intrinsic intra-class variability of dysarthric speech. We evaluated on the Universal Access Speech dataset using leave-one-speaker-out cross-validation, our model demonstrated superior performance over traditional machine learning approaches, with an accuracy of $70.48\%$ and an F1 score of $59.23\%$. Our SALR model also exceeded the previous benchmark for AI-based classification, which used support vector machines, by $16.58\%$. We open the black box of our model by visualising the latent space where we can observe how the model substantially reduces speaker-specific cues and amplifies task-specific ones, thereby showing its robustness. In conclusion, SALR establishes a new benchmark in speaker-independent multi-class dysarthria severity classification using generative AI. The potential implications of our findings for broader clinical applications in automated dysarthria severity assessments.
- Differential diagnostic patterns of dysarthria. Journal of speech and hearing research. 1969;12(2):246-69.
- Whitehill TL, Ciocca V. Speech errors in Cantonese speaking adults with cerebral palsy. Clinical linguistics & phonetics. 2000;14(2):111-30.
- Scott S, Caird FI. Speech therapy for Parkinson’s disease. Journal of Neurology, Neurosurgery & Psychiatry. 1983;46(2):140–144.
- Joy NM, Umesh S. Improving acoustic models in TORGO dysarthric speech database. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2018;26(3):637-45.
- Freed DB. Motor speech disorders: diagnosis and treatment. Plural Publishing; 2018.
- Gavidia-Ceballos L, Hansen JH. Direct speech feature estimation using an iterative EM algorithm for vocal fold pathology detection. IEEE Transactions on Biomedical Engineering. 1996;43(4):373-83.
- Baghai-Ravary L, Beet SW. Automatic speech signal analysis for clinical diagnosis and assessment of speech disorders. Springer Science & Business Media; 2012.
- A speech-controlled environmental control system for people with severe dysarthria. Medical Engineering & Physics. 2007;29(5):586-93.
- A comparative study of adaptive, automatic recognition of disordered speech. In: Thirteenth Annual Conference of the International Speech Communication Association; 2012.
- Attention is all you need. Advances in neural information processing systems. 2017;30.
- Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778. 2021.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems. 2020;33:12449-60.
- Dysarthric speech database for universal access research. In: Ninth Annual Conference of the International Speech Communication Association; 2008. .
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021;29:3451-60.
- Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:14125567. 2014.
- Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720. 2017.
- Dong X, Shen J. Triplet loss in siamese network for object tracking. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 459-74.
- Understanding self-attention of self-supervised audio transformers. arXiv preprint arXiv:2006.03265.
- Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. p. 785-94.
- Murtagh F. Multilayer perceptrons for classification and regression. Neurocomputing. 1991;2(5-6):183-97.
- Residual Neural Network precisely quantifies dysarthria severity-level based on short-duration speech segments. Neural Networks. 2021;139:105-17.
- Dysarthria Speech Detection Using Convolutional Neural Networks with Gated Recurrent Unit. In: Healthcare. vol. 10. MDPI; 2022. p. 1956.
- Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining; 2019. p. 2623-31.
- Improved speaker independent dysarthria intelligibility classification using deepspeech posteriors. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020. p. 6114-8.
- Cross-lingual dysarthria severity classification for english, korean, and tamil. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE; 2022. p. 566-74.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:191003771. 2019.
- Lauren Stumpf (1 paper)
- Balasundaram Kadirvelu (2 papers)
- Sigourney Waibel (1 paper)
- A. Aldo Faisal (39 papers)