Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition (2407.13782v1)
Abstract: Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
- V. Peddinti et al., “A time delay neural network architecture for efficient modeling of long temporal contexts,” in INTERSPEECH, 2015.
- A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in INTERSPEECH, 2020.
- H. Christensen et al., “A comparative study of adaptive, automatic recognition of disordered speech,” in INTERSPEECH, 2012.
- H. Christensen et al., “Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech.” in INTERSPEECH, 2013.
- S. Sehgal et al., “Model adaptation and adaptive training for the recognition of dysarthric speech,” in SLPAT, 2015.
- J. Yu et al., “Development of the CUHK Dysarthric Speech Recognition System for the UA Speech Corpus.” in INTERSPEECH, 2018.
- S. Hu et al., “The CUHK Dysarthric Speech Recognition Systems for English and Cantonese.” in INTERSPEECH, 2019.
- S. Liu et al., “Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition.” in INTERSPEECH, 2020.
- Z. Ye et al., “Development of the CUHK Elderly speech recognition system for neurocognitive disorder detection using the dementiabank corpus,” in ICASSP, 2021.
- Z. Jin et al., “Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition,” TASLP, 2022.
- M. Geng et al., “Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition,” TASLP, 2022.
- M. Geng et al., “Use of Speech Impairment Severity for Dysarthric Speech Recognition,” in INTERSPEECH, 2023.
- N. M. Joy et al., “Improving acoustic models in TORGO dysarthric speech database,” TNSRE, 2018.
- S. Hu et al., “Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition,” in ICASSP. IEEE, 2023, pp. 1–5.
- K. C. Fraser et al., “Linguistic features identify Alzheimer’s disease in narrative speech,” Journal of Alzheimer’s Disease, 2016.
- A. Association, “2019 Alzheimer’s disease facts and figures,” Alzheimer’s & dementia, 2019.
- M. Rohanian et al., “Alzheimer’s Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs,” in INTERSPEECH, 2021.
- R. B. Ammar et al., “Speech processing for early Alzheimer disease diagnosis: machine learning based approach,” in AICCSA, 2018.
- J. Li et al., “A comparative study of acoustic and linguistic features classification for alzheimer’s disease detection,” in ICASSP, 2021.
- B. Vachhani et al., “Deep Autoencoder Based Speech Features for Improved Dysarthric Speech Recognition.” in INTERSPEECH, 2017.
- M. J. Kim et al., “Dysarthric Speech Recognition Using Convolutional LSTM Neural Network.” in INTERSPEECH, 2018.
- S. Hu et al., “Exploiting Cross Domain Acoustic-to-articulatory Inverted Features for Disordered Speech Recognition,” in ICASSP, 2022.
- Y. Lin et al., “Staged Knowledge Distillation for End-to-End Dysarthric Speech Recognition and Speech Attribute Transcription,” in INTERSPEECH, 2020.
- F. Xiong et al., “Source domain data selection for improved transfer learning targeting dysarthric speech recognition,” in ICASSP, 2020.
- S. Liu et al., “Recent Progress in the CUHK Dysarthric Speech Recognition System,” TASLP, 2021.
- X. Xie et al., “Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition,” in INTERSPEECH, 2021.
- R. Takashima et al., “Two-step acoustic model adaptation for dysarthric speech recognition,” in ICASSP, 2020.
- E. Hermann et al., “Dysarthric speech recognition with lattice-free MMI,” in ICASSP, 2020.
- P. Wang et al., “A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech,” in INTERSPEECH, 2021.
- R. L. MacDonald et al., “Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia,” in INTERSPEECH, 2021.
- D. Wang et al., “Improved End-to-End Dysarthric Speech Recognition via Meta-learning Based Model Re-initialization,” in ISCSLP, 2021.
- J. R. Green et al., “Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases,” in INTERSPEECH, 2021.
- J. Tobin et al., “Personalized Automatic Speech Recognition Trained on Small Disordered Speech Datasets,” in ICASSP, 2022.
- J. Shor et al., “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” in INTERSPEECH, 2019.
- Z. Yue et al., “Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs,” in INTERSPEECH, 2022.
- L. Prananta et al., “The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition,” in INTERSPEECH, 2022.
- L. P. Violeta et al., “Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition,” in INTERSPEECH, 2022.
- C. Bhat et al., “Improved ASR Performance for Dysarthric Speech Using Two-stage DataAugmentation,” in INTERSPEECH, 2022.
- A. Hernandez et al., “Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition,” in INTERSPEECH, 2022.
- K. Tomanek et al., “An analysis of degenerating speech due to progressive dysarthria on ASR performance,” in ICASSP, 2023.
- Z. Yue et al., “Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition,” TASLP, 2022.
- Z. Jin et al., “Adversarial data augmentation using VAE-GAN for disordered speech recognition,” in ICASSP, 2023.
- M. Soleymanpour et al., “Synthesizing Dysarthric Speech Using Multi-Speaker Tts For Dysarthric Speech Recognition,” in ICASSP, 2022.
- R. Vipperla et al., “Ageing voices: The effect of changes in voice parameters on ASR performance,” EURASIP JASMP, 2010.
- F. Rudzicz et al., “Speech recognition in Alzheimer’s disease with personal assistive robots,” in SLPAT, 2014.
- L. Zhou et al., “Speech Recognition in Alzheimer’s Disease and in its Assessment,” in INTERSPEECH, 2016.
- S. Hu et al., “Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition,” in INTERSPEECH, 2023.
- T. Wang et al., “Conformer Based Elderly Speech Recognition System for Alzheimer’s Disease Detection,” in INTERSPEECH, 2022.
- A. König et al., “Fully automatic speech-based analysis of the semantic verbal fluency task,” Dementia and geriatric cognitive disorders, 2018.
- L. Tóth et al., “A speech recognition-based solution for the automatic detection of mild cognitive impairment from spontaneous speech,” Current Alzheimer Research, 2018.
- Y. Pan et al., “Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer’s Dementia Detection Through Spontaneous Speech,” in INTERSPEECH, 2021.
- A. Ablimit et al., “Exploring Dementia Detection from Speech: Cross Corpus Analysis,” in ICASSP, 2022.
- J. Gui et al., “End-to-End ASR-Enhanced Neural Network for Alzheimer’s Disease Diagnosis,” in ICASSP, 2022.
- Z. Shah et al., “Exploring Language-Agnostic Speech Representations Using Domain Knowledge for Detecting Alzheimer’s Dementia,” in ICASSP, 2023.
- L. Yang et al., “Augmented Adversarial Self-Supervised Learning for Early-Stage Alzheimer’s Speech Detection,” in INTERSPEECH, 2022.
- A. Ablimit et al., “Deep Learning Approaches for Detecting Alzheimer’s Dementia from Conversational Speech of ILSE Study,” in INTERSPEECH, 2022.
- B. Tamm et al., “Cross-Lingual Transfer Learning for Alzheimer’s Detection from Spontaneous Speech,” in ICASSP, 2023.
- H. Kim et al., “Dysarthric speech database for universal access research,” in INTERSPEECH, 2008.
- F. Rudzicz et al., “The TORGO database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, 2012.
- J. T. Becker et al., “The natural history of Alzheimer’s disease: description of study cohort and accuracy of diagnosis,” Archives of neurology, 1994.
- S. S. Xu et al., “Speaker Turn Aware Similarity Scoring for Diarization of Speech-Based Cognitive Assessments,” in APSIPA ASC, 2021.
- W. Verhelst et al., “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech,” in ICASSP, vol. 2. IEEE, 1993, pp. 554–557.
- N. Kanda et al., “Elastic spectral distortion for low resource speech recognition with deep neural networks,” in ASRU. IEEE, 2013, pp. 309–314.
- T. Ko et al., “Audio augmentation for speech recognition,” in INTERSPEECH, vol. 2015, 2015, p. 3586.
- Z. Jin et al., “Adversarial Data Augmentation for Disordered Speech Recognition,” in INTERSPEECH, 2021.
- F. Xiong et al., “Phonetic analysis of dysarthric speech tempo and applications to robust personalised dysarthric speech recognition,” in ICASSP, 2019.
- J. Harvill et al., “Synthesis of new words for improved dysarthric speech recognition on an expanded vocabulary,” in ICASSP. IEEE, 2021, pp. 6428–6432.
- Y. Jiao et al., “Simulating dysarthric speech for training data augmentation in clinical speech applications,” in ICASSP. IEEE, 2018, pp. 6009–6013.
- L. Prananta et al., “The effectiveness of time stretching for enhancing dysarthric speech for improved dysarthric speech recognition,” 2022.
- W.-C. Huang et al., “Towards identity preserving normal to dysarthric voice conversion,” in ICASSP. IEEE, 2022, pp. 6672–6676.
- W.-C. Huang et al., “A preliminary study of a two-stage paradigm for preserving speaker identity in dysarthric voice conversion,” 2021.
- H. Wang et al., “Duta-vc: A duration-aware typical-to-atypical voice conversion approach with diffusion probabilistic model,” 2023.
- J. Deng et al., “Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition,” in INTERSPEECH, 2021.
- R. Takashima et al., “Two-Step Acoustic Model Adaptation for Dysarthric Speech Recognition,” in ICASSP, 2020, pp. 6104–6108.
- M. Geng et al., “On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition,” in INTERSPEECH, 2023, pp. 1753–1757.
- S. Liu et al., “Exploiting Visual Features Using Bayesian Gated Neural Networks for Disordered Speech Recognition.” in INTERSPEECH, 2019, pp. 4120–4124.
- E. S. Salama et al., “Audio-visual speech recognition for people with speech disorders,” IJCA, vol. 96, no. 2, 2014.
- Y. Takashima et al., “Audio-visual speech recognition for a person with severe hearing loss using deep canonical correlation analysis,” in CHAT, 2017, pp. 77–81.
- S. K. Maharana et al., “Acoustic-to-Articulatory Inversion for Dysarthric Speech by Using Cross-Corpus Acoustic-Articulatory Data,” in ICASSP, 2021.
- A. Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in NeuralIPS, 2020.
- S. Chen et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” JSTSP, 2022.
- W.-N. Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” TASLP, 2021.
- A. Baevski et al., “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” in ICML, 2022.
- A. Babu et al., “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale,” in INTERSPEECH, 2022.
- A. Conneau et al., “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in INTERSPEECH, 2021.
- L. Pepino et al., “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” in INTERSPEECH, 2021.
- N. Vaessen et al., “Fine-Tuning Wav2Vec2 for Speaker Recognition,” in ICASSP, 2022.
- W.-C. Huang et al., “S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations,” in ICASSP, 2022.
- J. Ni et al., “Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition,” in INTERSPEECH, 2022.
- P. Kumar et al., “Investigation of Robustness of Hubert Features from Different Layers to Domain, Accent and Language Variations,” in ICASSP, 2022.
- M. K. Baskar et al., “Speaker adaptation for Wav2vec2 based dysarthric ASR,” in INTERSPEECH, 2022.
- C. Yu et al., “Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models,” TNSRE, 2023.
- P. Wang et al., “Benefits of pre-trained mono-and cross-lingual speech representations for spoken language understanding of Dutch dysarthric speech,” EURASIP JASMP, 2023.
- T. Matsushima, “Dutch dysarthric speech recognition: Applying self-supervised learning to overcome the data scarcity issue,” Ph.D. dissertation, 2022.
- F. Rudzicz, “Articulatory knowledge in the recognition of dysarthric speech,” TASLP, 2010.
- J. A. Gonzalez et al., “Direct speech reconstruction from articulatory sensor data by machine learning,” TASLP, 2017.
- F. Xiong et al., “Deep learning of articulatory-based representations and applications for improving dysarthric speech recognition,” in ITG-Symposium, 2018.
- E. Yılmaz et al., “Articulatory and bottleneck features for speaker-independent ASR of dysarthric speech,” Computer Speech & Language, 2019.
- J. Cai et al., “Recognition and Real Time Performances of a Lightweight Ultrasound Based Silent Speech Interface Employing a Language Model.” in INTERSPEECH, 2011.
- A. Eshky et al., “UltraSuite: a repository of ultrasound and acoustic data from child speech therapy sessions,” in INTERSPEECH, 2018.
- M. S. Ribeiro et al., “TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos,” in SLT, 2021.
- K. Richmond et al., “Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus,” in INTERSPEECH, 2011, pp. 1505–1508.
- A. A. Wrench, “A Multi-Channel/Multi-Speaker Articulatory Database for Continuous Speech Recognition Research.” Phonus., 2000.
- T. Wang et al., “Hyper-parameter Adaptation of Conformer ASR Systems for Elderly and Dysarthric Speech Recognition,” 2023.
- Y. Wang et al., “Exploring linguistic feature and model combination for speech recognition based automatic AD detection,” in INTERSPEECH, 2022.
- Y. Wang et al., “Exploiting Prompt Learning with Pre-Trained Language Models for Alzheimer’s Disease Detection,” in ICASSP, 2023.
- P. Swietojanski et al., “Revisiting hybrid and GMM-HMM system combination techniques,” in ICASSP, 2013.
- M. Cui et al., “Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems,” in INTERSPEECH, 2022.
- S. Luz et al., “Alzheimer’s Dementia Recognition Through Spontaneous Speech: The ADReSS Challenge,” in INTERSPEECH, 2020.
- A. Graves et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
- A. Mohamed et al., “Self-supervised speech representation learning: A review,” JSTSP, 2022.
- M. Stone, “A guide to analysing tongue motion from ultrasound images,” Clinical linguistics & phonetics, 2005.
- M. S. Ribeiro et al., “Speaker-independent classification of phonetic segments from raw ultrasound in child speech,” in ICASSP, 2019.
- J. Cleland et al., “Enabling new articulatory gestures in children with persistent speech sound disorders using ultrasound visual biofeedback,” JSLHR, 2019.
- G. Papcun et al., “Inferring articulation and recognizing gestures from acoustics with a neural network trained on x-ray microbeam data,” JASA, 1992.
- B. Uria et al., “Deep architectures for articulatory inversion,” in INTERSPEECH, 2012.
- X. Xie et al., “Investigation of stacked deep neural networks and mixture density networks for acoustic-to-articulatory inversion,” in ISCSLP, 2018.
- S. Young et al., “The HTK book,” Cambridge university engineering department, 2002.
- M. Geng et al., “Investigation of Data Augmentation Techniques for Disordered Speech Recognition.” in INTERSPEECH, 2020.
- S. Kim et al., “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017.
- C. Yi et al., “Transfer Ability of Monolingual Wav2vec2.0 for Low-resource Speech Recognition,” in IJCNN, 2021.
- P. Swietojanski et al., “Learning hidden unit contributions for unsupervised acoustic model adaptation,” TASLP, 2016.
- L. Gillick et al., “Some statistical issues in the comparison of speech recognition algorithms,” in ICASSP, 1989.
- D. Povey et al., “The Kaldi speech recognition toolkit,” in ASRU, 2011.
- S. Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” in INTERSPEECH, 2018.
- A. Stolcke, “SRILM - an extensible language modeling toolkit,” in ICSLP, 2002.
- J. D. M.-W. C. Kenton et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” in ICLR, 2020.
- Z. S. Syed et al., “Automated recognition of Alzheimer’s dementia using bag-of-deep-features and model ensembling,” Access, 2021.
- M. Martinc et al., “Temporal integration of text transcripts and acoustic features for Alzheimer’s diagnosis based on spontaneous speech,” Frontiers in Aging Neuroscience, 2021.
- J. Laguarta et al., “Longitudinal speech biomarkers for automated alzheimer’s detection,” frontiers in Computer Science, 2021.
- Shujie Hu (36 papers)
- Xurong Xie (38 papers)
- Mengzhe Geng (42 papers)
- Zengrui Jin (30 papers)
- Jiajun Deng (75 papers)
- Guinan Li (23 papers)
- Yi Wang (1038 papers)
- Mingyu Cui (31 papers)
- Tianzi Wang (37 papers)
- Helen Meng (204 papers)
- Xunying Liu (92 papers)