Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition (2407.13782v1)

Published 3 Jul 2024 in eess.AS, cs.AI, and cs.SD

Abstract: Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (131)
  1. V. Peddinti et al., “A time delay neural network architecture for efficient modeling of long temporal contexts,” in INTERSPEECH, 2015.
  2. A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in INTERSPEECH, 2020.
  3. H. Christensen et al., “A comparative study of adaptive, automatic recognition of disordered speech,” in INTERSPEECH, 2012.
  4. H. Christensen et al., “Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech.” in INTERSPEECH, 2013.
  5. S. Sehgal et al., “Model adaptation and adaptive training for the recognition of dysarthric speech,” in SLPAT, 2015.
  6. J. Yu et al., “Development of the CUHK Dysarthric Speech Recognition System for the UA Speech Corpus.” in INTERSPEECH, 2018.
  7. S. Hu et al., “The CUHK Dysarthric Speech Recognition Systems for English and Cantonese.” in INTERSPEECH, 2019.
  8. S. Liu et al., “Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition.” in INTERSPEECH, 2020.
  9. Z. Ye et al., “Development of the CUHK Elderly speech recognition system for neurocognitive disorder detection using the dementiabank corpus,” in ICASSP, 2021.
  10. Z. Jin et al., “Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition,” TASLP, 2022.
  11. M. Geng et al., “Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition,” TASLP, 2022.
  12. M. Geng et al., “Use of Speech Impairment Severity for Dysarthric Speech Recognition,” in INTERSPEECH, 2023.
  13. N. M. Joy et al., “Improving acoustic models in TORGO dysarthric speech database,” TNSRE, 2018.
  14. S. Hu et al., “Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition,” in ICASSP.   IEEE, 2023, pp. 1–5.
  15. K. C. Fraser et al., “Linguistic features identify Alzheimer’s disease in narrative speech,” Journal of Alzheimer’s Disease, 2016.
  16. A. Association, “2019 Alzheimer’s disease facts and figures,” Alzheimer’s & dementia, 2019.
  17. M. Rohanian et al., “Alzheimer’s Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs,” in INTERSPEECH, 2021.
  18. R. B. Ammar et al., “Speech processing for early Alzheimer disease diagnosis: machine learning based approach,” in AICCSA, 2018.
  19. J. Li et al., “A comparative study of acoustic and linguistic features classification for alzheimer’s disease detection,” in ICASSP, 2021.
  20. B. Vachhani et al., “Deep Autoencoder Based Speech Features for Improved Dysarthric Speech Recognition.” in INTERSPEECH, 2017.
  21. M. J. Kim et al., “Dysarthric Speech Recognition Using Convolutional LSTM Neural Network.” in INTERSPEECH, 2018.
  22. S. Hu et al., “Exploiting Cross Domain Acoustic-to-articulatory Inverted Features for Disordered Speech Recognition,” in ICASSP, 2022.
  23. Y. Lin et al., “Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription,” in INTERSPEECH, 2020.
  24. F. Xiong et al., “Source domain data selection for improved transfer learning targeting dysarthric speech recognition,” in ICASSP, 2020.
  25. S. Liu et al., “Recent Progress in the CUHK Dysarthric Speech Recognition System,” TASLP, 2021.
  26. X. Xie et al., “Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition,” in INTERSPEECH, 2021.
  27. R. Takashima et al., “Two-step acoustic model adaptation for dysarthric speech recognition,” in ICASSP, 2020.
  28. E. Hermann et al., “Dysarthric speech recognition with lattice-free MMI,” in ICASSP, 2020.
  29. P. Wang et al., “A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech,” in INTERSPEECH, 2021.
  30. R. L. MacDonald et al., “Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia,” in INTERSPEECH, 2021.
  31. D. Wang et al., “Improved End-to-End Dysarthric Speech Recognition via Meta-learning Based Model Re-initialization,” in ISCSLP, 2021.
  32. J. R. Green et al., “Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases,” in INTERSPEECH, 2021.
  33. J. Tobin et al., “Personalized Automatic Speech Recognition Trained on Small Disordered Speech Datasets,” in ICASSP, 2022.
  34. J. Shor et al., “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” in INTERSPEECH, 2019.
  35. Z. Yue et al., “Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs,” in INTERSPEECH, 2022.
  36. L. Prananta et al., “The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition,” in INTERSPEECH, 2022.
  37. L. P. Violeta et al., “Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition,” in INTERSPEECH, 2022.
  38. C. Bhat et al., “Improved ASR Performance for Dysarthric Speech Using Two-stage DataAugmentation,” in INTERSPEECH, 2022.
  39. A. Hernandez et al., “Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition,” in INTERSPEECH, 2022.
  40. K. Tomanek et al., “An analysis of degenerating speech due to progressive dysarthria on ASR performance,” in ICASSP, 2023.
  41. Z. Yue et al., “Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition,” TASLP, 2022.
  42. Z. Jin et al., “Adversarial data augmentation using VAE-GAN for disordered speech recognition,” in ICASSP, 2023.
  43. M. Soleymanpour et al., “Synthesizing Dysarthric Speech Using Multi-Speaker Tts For Dysarthric Speech Recognition,” in ICASSP, 2022.
  44. R. Vipperla et al., “Ageing voices: The effect of changes in voice parameters on ASR performance,” EURASIP JASMP, 2010.
  45. F. Rudzicz et al., “Speech recognition in Alzheimer’s disease with personal assistive robots,” in SLPAT, 2014.
  46. L. Zhou et al., “Speech Recognition in Alzheimer’s Disease and in its Assessment,” in INTERSPEECH, 2016.
  47. S. Hu et al., “Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition,” in INTERSPEECH, 2023.
  48. T. Wang et al., “Conformer Based Elderly Speech Recognition System for Alzheimer’s Disease Detection,” in INTERSPEECH, 2022.
  49. A. König et al., “Fully automatic speech-based analysis of the semantic verbal fluency task,” Dementia and geriatric cognitive disorders, 2018.
  50. L. Tóth et al., “A speech recognition-based solution for the automatic detection of mild cognitive impairment from spontaneous speech,” Current Alzheimer Research, 2018.
  51. Y. Pan et al., “Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer’s Dementia Detection Through Spontaneous Speech,” in INTERSPEECH, 2021.
  52. A. Ablimit et al., “Exploring Dementia Detection from Speech: Cross Corpus Analysis,” in ICASSP, 2022.
  53. J. Gui et al., “End-to-End ASR-Enhanced Neural Network for Alzheimer’s Disease Diagnosis,” in ICASSP, 2022.
  54. Z. Shah et al., “Exploring Language-Agnostic Speech Representations Using Domain Knowledge for Detecting Alzheimer’s Dementia,” in ICASSP, 2023.
  55. L. Yang et al., “Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer’s Speech Detection,” in INTERSPEECH, 2022.
  56. A. Ablimit et al., “Deep Learning Approaches for Detecting Alzheimer’s Dementia from Conversational Speech of ILSE Study,” in INTERSPEECH, 2022.
  57. B. Tamm et al., “Cross-Lingual Transfer Learning for Alzheimer’s Detection from Spontaneous Speech,” in ICASSP, 2023.
  58. H. Kim et al., “Dysarthric speech database for universal access research,” in INTERSPEECH, 2008.
  59. F. Rudzicz et al., “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, 2012.
  60. J. T. Becker et al., “The natural history of Alzheimer’s disease: description of study cohort and accuracy of diagnosis,” Archives of neurology, 1994.
  61. S. S. Xu et al., “Speaker Turn Aware Similarity Scoring for Diarization of Speech-Based Cognitive Assessments,” in APSIPA ASC, 2021.
  62. W. Verhelst et al., “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech,” in ICASSP, vol. 2.   IEEE, 1993, pp. 554–557.
  63. N. Kanda et al., “Elastic spectral distortion for low resource speech recognition with deep neural networks,” in ASRU.   IEEE, 2013, pp. 309–314.
  64. T. Ko et al., “Audio augmentation for speech recognition,” in INTERSPEECH, vol. 2015, 2015, p. 3586.
  65. Z. Jin et al., “Adversarial Data Augmentation for Disordered Speech Recognition,” in INTERSPEECH, 2021.
  66. F. Xiong et al., “Phonetic analysis of dysarthric speech tempo and applications to robust personalised dysarthric speech recognition,” in ICASSP, 2019.
  67. J. Harvill et al., “Synthesis of new words for improved dysarthric speech recognition on an expanded vocabulary,” in ICASSP.   IEEE, 2021, pp. 6428–6432.
  68. Y. Jiao et al., “Simulating dysarthric speech for training data augmentation in clinical speech applications,” in ICASSP.   IEEE, 2018, pp. 6009–6013.
  69. L. Prananta et al., “The effectiveness of time stretching for enhancing dysarthric speech for improved dysarthric speech recognition,” 2022.
  70. W.-C. Huang et al., “Towards identity preserving normal to dysarthric voice conversion,” in ICASSP.   IEEE, 2022, pp. 6672–6676.
  71. W.-C. Huang et al., “A preliminary study of a two-stage paradigm for preserving speaker identity in dysarthric voice conversion,” 2021.
  72. H. Wang et al., “Duta-vc: A duration-aware typical-to-atypical voice conversion approach with diffusion probabilistic model,” 2023.
  73. J. Deng et al., “Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition,” in INTERSPEECH, 2021.
  74. R. Takashima et al., “Two-Step Acoustic Model Adaptation for Dysarthric Speech Recognition,” in ICASSP, 2020, pp. 6104–6108.
  75. M. Geng et al., “On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition,” in INTERSPEECH, 2023, pp. 1753–1757.
  76. S. Liu et al., “Exploiting Visual Features Using Bayesian Gated Neural Networks for Disordered Speech Recognition.” in INTERSPEECH, 2019, pp. 4120–4124.
  77. E. S. Salama et al., “Audio-visual speech recognition for people with speech disorders,” IJCA, vol. 96, no. 2, 2014.
  78. Y. Takashima et al., “Audio-visual speech recognition for a person with severe hearing loss using deep canonical correlation analysis,” in CHAT, 2017, pp. 77–81.
  79. S. K. Maharana et al., “Acoustic-to-Articulatory Inversion for Dysarthric Speech by Using Cross-Corpus Acoustic-Articulatory Data,” in ICASSP, 2021.
  80. A. Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in NeuralIPS, 2020.
  81. S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” JSTSP, 2022.
  82. W.-N. Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” TASLP, 2021.
  83. A. Baevski et al., “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” in ICML, 2022.
  84. A. Babu et al., “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” in INTERSPEECH, 2022.
  85. A. Conneau et al., “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in INTERSPEECH, 2021.
  86. L. Pepino et al., “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” in INTERSPEECH, 2021.
  87. N. Vaessen et al., “Fine-Tuning Wav2Vec2 for Speaker Recognition,” in ICASSP, 2022.
  88. W.-C. Huang et al., “S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations,” in ICASSP, 2022.
  89. J. Ni et al., “Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition,” in INTERSPEECH, 2022.
  90. P. Kumar et al., “Investigation of Robustness of Hubert Features from Different Layers to Domain, Accent and Language Variations,” in ICASSP, 2022.
  91. M. K. Baskar et al., “Speaker adaptation for Wav2vec2 based dysarthric ASR,” in INTERSPEECH, 2022.
  92. C. Yu et al., “Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models,” TNSRE, 2023.
  93. P. Wang et al., “Benefits of pre-trained mono-and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech,” EURASIP JASMP, 2023.
  94. T. Matsushima, “Dutch dysarthric speech recognition: Applying self-supervised learning to overcome the data scarcity issue,” Ph.D. dissertation, 2022.
  95. F. Rudzicz, “Articulatory knowledge in the recognition of dysarthric speech,” TASLP, 2010.
  96. J. A. Gonzalez et al., “Direct speech reconstruction from articulatory sensor data by machine learning,” TASLP, 2017.
  97. F. Xiong et al., “Deep learning of articulatory-based representations and applications for improving dysarthric speech recognition,” in ITG-Symposium, 2018.
  98. E. Yılmaz et al., “Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech,” Computer Speech & Language, 2019.
  99. J. Cai et al., “Recognition and Real Time Performances of a Lightweight Ultrasound Based Silent Speech Interface Employing a Language Model.” in INTERSPEECH, 2011.
  100. A. Eshky et al., “UltraSuite: a repository of ultrasound and acoustic data from child speech therapy sessions,” in INTERSPEECH, 2018.
  101. M. S. Ribeiro et al., “TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos,” in SLT, 2021.
  102. K. Richmond et al., “Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus,” in INTERSPEECH, 2011, pp. 1505–1508.
  103. A. A. Wrench, “A Multi-Channel/Multi-Speaker Articulatory Database for Continuous Speech Recognition Research.” Phonus., 2000.
  104. T. Wang et al., “Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition,” 2023.
  105. Y. Wang et al., “Exploring linguistic feature and model combination for speech recognition based automatic AD detection,” in INTERSPEECH, 2022.
  106. Y. Wang et al., “Exploiting Prompt Learning with Pre-Trained Language Models for Alzheimer’s Disease Detection,” in ICASSP, 2023.
  107. P. Swietojanski et al., “Revisiting hybrid and GMM-HMM system combination techniques,” in ICASSP, 2013.
  108. M. Cui et al., “Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems,” in INTERSPEECH, 2022.
  109. S. Luz et al., “Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge,” in INTERSPEECH, 2020.
  110. A. Graves et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
  111. A. Mohamed et al., “Self-supervised speech representation learning: A review,” JSTSP, 2022.
  112. M. Stone, “A guide to analysing tongue motion from ultrasound images,” Clinical linguistics & phonetics, 2005.
  113. M. S. Ribeiro et al., “Speaker-independent classification of phonetic segments from raw ultrasound in child speech,” in ICASSP, 2019.
  114. J. Cleland et al., “Enabling new articulatory gestures in children with persistent speech sound disorders using ultrasound visual biofeedback,” JSLHR, 2019.
  115. G. Papcun et al., “Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data,” JASA, 1992.
  116. B. Uria et al., “Deep architectures for articulatory inversion,” in INTERSPEECH, 2012.
  117. X. Xie et al., “Investigation of stacked deep neural networks and mixture density networks for acoustic-to-articulatory inversion,” in ISCSLP, 2018.
  118. S. Young et al., “The HTK book,” Cambridge university engineering department, 2002.
  119. M. Geng et al., “Investigation of Data Augmentation Techniques for Disordered Speech Recognition.” in INTERSPEECH, 2020.
  120. S. Kim et al., “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017.
  121. C. Yi et al., “Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition,” in IJCNN, 2021.
  122. P. Swietojanski et al., “Learning hidden unit contributions for unsupervised acoustic model adaptation,” TASLP, 2016.
  123. L. Gillick et al., “Some statistical issues in the comparison of speech recognition algorithms,” in ICASSP, 1989.
  124. D. Povey et al., “The Kaldi speech recognition toolkit,” in ASRU, 2011.
  125. S. Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” in INTERSPEECH, 2018.
  126. A. Stolcke, “SRILM - an extensible language modeling toolkit,” in ICSLP, 2002.
  127. J. D. M.-W. C. Kenton et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  128. Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” in ICLR, 2020.
  129. Z. S. Syed et al., “Automated recognition of Alzheimer’s dementia using bag-of-deep-features and model ensembling,” Access, 2021.
  130. M. Martinc et al., “Temporal integration of text transcripts and acoustic features for Alzheimer’s diagnosis based on spontaneous speech,” Frontiers in Aging Neuroscience, 2021.
  131. J. Laguarta et al., “Longitudinal speech biomarkers for automated alzheimer’s detection,” frontiers in Computer Science, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Shujie Hu (36 papers)
  2. Xurong Xie (38 papers)
  3. Mengzhe Geng (42 papers)
  4. Zengrui Jin (30 papers)
  5. Jiajun Deng (75 papers)
  6. Guinan Li (23 papers)
  7. Yi Wang (1038 papers)
  8. Mingyu Cui (31 papers)
  9. Tianzi Wang (37 papers)
  10. Helen Meng (204 papers)
  11. Xunying Liu (92 papers)