Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation (2401.00662v1)

Published 1 Jan 2024 in cs.SD and eess.AS

Abstract: Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained ASR models to limited dysarthric speech via data-intensive parameter fine-tuning leads to poor generalization. To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. These include: a) conventional speaker-independent perturbation of impaired speech; b) speaker-dependent speed perturbation, or GAN-based adversarial perturbation of normal, control speech based on their time alignment against parallel dysarthric speech; c) novel Spectral basis GAN-based adversarial data augmentation operating on non-parallel data. Experiments conducted on the UASpeech corpus suggest GAN-based data augmentation consistently outperforms fine-tuned Wav2vec2.0 and HuBERT models using no data augmentation and speed perturbation across different data expansion operating points by statistically significant word error rate (WER) reductions up to 2.01% and 0.96% absolute (9.03% and 4.63% relative) respectively on the UASpeech test set of 16 dysarthric speakers. After cross-system outputs rescoring, the best system produced the lowest published WER of 16.53% (46.47% on very low intelligibility) on UASpeech.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. H. Christensen et al., “Combining in-domain and out-of-domain speech data for automatic recognition of disordered speech,” in INTERSPEECH, 2013.
  2. S. Liu et al., “Recent Progress in the CUHK Dysarthric Speech Recognition System,” IEEE/ACM TASLP, 2021.
  3. F. Xiong et al., “Source Domain Data Selection for Improved Transfer Learning Targeting Dysarthric Speech Recognition,” in ICASSP, 2020.
  4. Z. Jin et al., “Adversarial Data Augmentation for Disordered Speech Recognition,” in INTERSPEECH, 2021.
  5. Z. Yue et al., “Acoustic Modelling From Raw Source and Filter Components for Dysarthric Speech Recognition,” IEEE/ACM TASLP, 2022.
  6. M. Geng et al., “On-the-Fly Feature Based Rapid Speaker Adaptation for Dysarthric and Elderly Speech Recognition,” in INTERSPEECH, 2023.
  7. W. Verhelst et al., “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech,” in ICASSP, 1993.
  8. N. Kanda et al., “Elastic Spectral Distortion for Low Resource Speech Recognition with Deep Neural Networks,” in IEEE ASRU, 2013.
  9. T. Ko et al., “Audio Augmentation for Speech Recognition,” in INTERSPEECH, 2015.
  10. M. Geng et al., “Investigation of Data Augmentation Techniques for Disordered Speech Recognition,” INTERSPEECH, 2020.
  11. J. Harvill et al., “Synthesis of New Words for Improved Dysarthric Speech Recognition on an Expanded Vocabulary,” in ICASSP, 2021.
  12. Y. Jiao et al., “Simulating Dysarthric Speech for Training Data Augmentation in Clinical Speech Applications,” in ICASSP, 2018.
  13. L. Prananta et al., “The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition,” in INTERSPEECH, 2022.
  14. W.-C. Huang et al., “Towards Identity Preserving Normal to Dysarthric Voice Conversion,” in ICASSP, 2022.
  15. W.-C. Huang et al., “A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion,” in INTERSPEECH, 2021.
  16. H. Wang et al., “DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model,” in INTERSPEECH, 2023.
  17. A. Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” NeurIPS, 2020.
  18. S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE J. Sel. Top. Signal Process., 2022.
  19. W.-N. Hsu et al., “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM TASLP, 2021.
  20. L. Pepino et al., “Emotion Recognition from Speech Using wav2vec 2.0 Embeddings,” in INTERSPEECH, 2021.
  21. N. Vaessen et al., “Fine-Tuning Wav2Vec2 for Speaker Recognition,” in ICASSP, 2022.
  22. A. Hernandez et al., “Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition,” in INTERSPEECH, 2022.
  23. L. Violeta et al., “Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition,” in INTERSPEECH, 2022.
  24. M. K. Baskar et al., “Speaker Adaptation for Wav2vec2 based Dysarthric ASR,” in INTERSPEECH, 2022.
  25. S. Hu et al., “Exploring Self-supervised Pre-trained ASR Models for Dysarthric and Elderly Speech Recognition,” in ICASSP, 2023.
  26. C. Yu et al., “Multi-Stage Audio-Visual Fusion for Dysarthric Speech Recognition With Pre-Trained Models,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2023.
  27. P. Wang et al., “Benefits of Pre-Trained Mono- and Cross-Lingual Speech Representations for Spoken Language Understanding of Dutch Dysarthric Speech,” EURASIP J. Audio Speech Music Process., 2023.
  28. H. Kim et al., “Dysarthric Speech Database for Universal Access Research,” in INTERSPEECH, 2008.
  29. A. Conneau et al., “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in INTERSPEECH, 2021.
  30. B. Shi et al., “Learning audio-visual speech representation by masked multimodal cluster prediction,” in ICLR, 2022.
  31. M. Huh et al., “A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit,” arXiv preprint arXiv:2303.00510, 2023.
  32. Z. Jin et al., “Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition,” arXiv preprint arXiv:2205.06445, 2022.
  33. Z. Jin et al., “Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition,” in ICASSP, 2023.
  34. M. Baali et al., “Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation,” in INTERSPEECH, 2023.
  35. M. Geng et al., “Use of Speech Impairment Severity for Dysarthric Speech Recognition,” in INTERSPEECH, 2023.
  36. F. Xiong et al., “Phonetic Analysis of Dysarthric Speech Tempo and Applications to Robust Personalised Dysarthric Speech Recognition,” in ICASSP, 2019.
  37. M. Geng et al., “Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition,” in INTERSPEECH, 2021.
  38. L. Gillick et al., “Some Statistical Issues in the Comparison of Speech Recognition Algorithms,” in ICASSP, 1989.
  39. M. Cui et al., “Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems,” INTERSPEECH, 2022.
  40. S. Young et al., “The HTK Book,” Cambridge University Engineering Department, 2002.
  41. S. Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” in INTERSPEECH, 2018.
  42. Z. Ye et al., “Development of the CUHK Elderly Speech Recognition System for Neurocognitive Disorder Detection Using the Dementiabank Corpus,” in ICASSP, 2021.
  43. M. Geng et al., “Speaker adaptation using spectro-temporal deep features for dysarthric and elderly speech recognition,” IEEE/ACM TASLP, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Huimeng Wang (9 papers)
  2. Zengrui Jin (30 papers)
  3. Mengzhe Geng (42 papers)
  4. Shujie Hu (36 papers)
  5. Guinan Li (23 papers)
  6. Tianzi Wang (37 papers)
  7. Haoning Xu (9 papers)
  8. Xunying Liu (92 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.