Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation (2401.00662v1)
Abstract: Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech.
- H. Christensen et al., “Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech,” in INTERSPEECH, 2013.
- S. Liu et al., “Recent Progress in the CUHK Dysarthric Speech Recognition System,” IEEE/ACM TASLP, 2021.
- F. Xiong et al., “Source Domain Data Selection for Improved Transfer Learning Targeting Dysarthric Speech Recognition,” in ICASSP, 2020.
- Z. Jin et al., “Adversarial Data Augmentation for Disordered Speech Recognition,” in INTERSPEECH, 2021.
- Z. Yue et al., “Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition,” IEEE/ACM TASLP, 2022.
- M. Geng et al., “On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition,” in INTERSPEECH, 2023.
- W. Verhelst et al., “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech,” in ICASSP, 1993.
- N. Kanda et al., “Elastic Spectral Distortion for Low Resource Speech Recognition with Deep Neural Networks,” in IEEE ASRU, 2013.
- T. Ko et al., “Audio Augmentation for Speech Recognition,” in INTERSPEECH, 2015.
- M. Geng et al., “Investigation of Data Augmentation Techniques for Disordered Speech Recognition,” INTERSPEECH, 2020.
- J. Harvill et al., “Synthesis of New Words for Improved Dysarthric Speech Recognition on an Expanded Vocabulary,” in ICASSP, 2021.
- Y. Jiao et al., “Simulating Dysarthric Speech for Training Data Augmentation in Clinical Speech Applications,” in ICASSP, 2018.
- L. Prananta et al., “The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition,” in INTERSPEECH, 2022.
- W.-C. Huang et al., “Towards Identity Preserving Normal to Dysarthric Voice Conversion,” in ICASSP, 2022.
- W.-C. Huang et al., “A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion,” in INTERSPEECH, 2021.
- H. Wang et al., “DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model,” in INTERSPEECH, 2023.
- A. Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” NeurIPS, 2020.
- S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE J. Sel. Top. Signal Process., 2022.
- W.-N. Hsu et al., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM TASLP, 2021.
- L. Pepino et al., “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” in INTERSPEECH, 2021.
- N. Vaessen et al., “Fine-Tuning Wav2Vec2 for Speaker Recognition,” in ICASSP, 2022.
- A. Hernandez et al., “Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition,” in INTERSPEECH, 2022.
- L. Violeta et al., “Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition,” in INTERSPEECH, 2022.
- M. K. Baskar et al., “Speaker Adaptation for Wav2vec2 based Dysarthric ASR,” in INTERSPEECH, 2022.
- S. Hu et al., “Exploring Self-supervised Pre-trained ASR Models for Dysarthric and Elderly Speech Recognition,” in ICASSP, 2023.
- C. Yu et al., “Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2023.
- P. Wang et al., “Benefits of Pre-Trained Mono- and Cross-Lingual Speech Representations for Spoken Language Understanding of Dutch Dysarthric Speech,” EURASIP J. Audio Speech Music Process., 2023.
- H. Kim et al., “Dysarthric Speech Database for Universal Access Research,” in INTERSPEECH, 2008.
- A. Conneau et al., “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in INTERSPEECH, 2021.
- B. Shi et al., “Learning audio-visual speech representation by masked multimodal cluster prediction,” in ICLR, 2022.
- M. Huh et al., “A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit,” arXiv preprint arXiv:2303.00510, 2023.
- Z. Jin et al., “Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition,” arXiv preprint arXiv:2205.06445, 2022.
- Z. Jin et al., “Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition,” in ICASSP, 2023.
- M. Baali et al., “Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation,” in INTERSPEECH, 2023.
- M. Geng et al., “Use of Speech Impairment Severity for Dysarthric Speech Recognition,” in INTERSPEECH, 2023.
- F. Xiong et al., “Phonetic Analysis of Dysarthric Speech Tempo and Applications to Robust Personalised Dysarthric Speech Recognition,” in ICASSP, 2019.
- M. Geng et al., “Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition,” in INTERSPEECH, 2021.
- L. Gillick et al., “Some Statistical Issues in the Comparison of Speech Recognition Algorithms,” in ICASSP, 1989.
- M. Cui et al., “Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems,” INTERSPEECH, 2022.
- S. Young et al., “The HTK Book,” Cambridge University Engineering Department, 2002.
- S. Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” in INTERSPEECH, 2018.
- Z. Ye et al., “Development of the CUHK Elderly Speech Recognition System for Neurocognitive Disorder Detection Using the Dementiabank Corpus,” in ICASSP, 2021.
- M. Geng et al., “Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition,” IEEE/ACM TASLP, 2022.
- Huimeng Wang (9 papers)
- Zengrui Jin (30 papers)
- Mengzhe Geng (42 papers)
- Shujie Hu (36 papers)
- Guinan Li (23 papers)
- Tianzi Wang (37 papers)
- Haoning Xu (9 papers)
- Xunying Liu (92 papers)