Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Automatic Data Augmentation for Disordered Speech Recognition (2312.08641v1)

Published 14 Dec 2023 in eess.AS and cs.SD

Abstract: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and end-to-end Conformer ASR systems on such data. The handcrafted temporal and spectral mask operations in the standard SpecAugment method that are task and system dependent, together with additionally introduced minimum and maximum cut-offs of these time-frequency masks, are now automatically learned using an RNN-based policy controller and tightly integrated with ASR system training. Experiments on the UASpeech corpus suggest the proposed RL-based data augmentation approach consistently produced performance superior or comparable that obtained using expert or handcrafted SpecAugment policies. Our RL auto-augmented PyChain TDNN system produced an overall WER of 28.79% on the UASpeech test set of 16 dysarthric speakers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. H. Christensen et al., “A Comparative Study of Adaptive, Automatic Recognition of Disordered Speech,” in INTERSPEECH, 2012.
  2. H. Christensen et al., “Combining In-Domain and Out-of-Domain Speech Data for Automatic Recognition of Disordered Speech,” in INTERSPEECH, 2013.
  3. S. Sehgal et al., “Model Adaptation and Adaptive Training for the Recognition of Dysarthric Speech,” in SLPAT, 2015.
  4. J. Yu et al., “Development of the CUHK Dysarthric Speech Recognition System for the UA Speech Corpus,” in INTERSPEECH, 2018.
  5. M. Geng et al., “Investigation of Data Augmentation Techniques for Disordered Speech Recognition,” INTERSPEECH, 2020.
  6. S. Hu et al., “Exploiting Cross Domain Acoustic-to-articulatory Inverted Features for Disordered Speech Recognition,” in ICASSP, 2022.
  7. W. Verhelst et al., “An Overlap-add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-scale Modification of Speech,” in ICASSP, 1993.
  8. N. Kanda et al., “Elastic Spectral Distortion for Low Resource Speech Recognition with Deep Neural Networks,” in ASRU, 2013.
  9. N. Jaitly et al., “Vocal Tract Length Perturbation (VTLP) Improves Speech Recognition,” in ICML WDLASL, 2013.
  10. T. Ko et al., “Audio Augmentation for Speech Recognition,” in INTERSPEECH, 2015.
  11. T. Ko et al., “A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition,” in ICASSP, 2017.
  12. X. Cui et al., “Data Augmentation for Deep Neural Network Acoustic Modeling,” IEEE/ACM TASLP, 2015.
  13. T. Hayashi et al., “Back-Translation-Style Data Augmentation for End-to-End ASR,” in IEEE SLT, 2018.
  14. D. S. Park et al., “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in INTERSPEECH, 2019.
  15. D. S. Park et al., “SpecAugment on Large Scale Datasets,” in ICASSP, 2020.
  16. T.-Y. Hu et al., “SapAugment: Learning a Sample Adaptive Policy for Data Augmentation,” in ICASSP, 2021.
  17. A. Jain et al., “SPLICEOUT: A Simple and Efficient Audio Augmentation Method,” in INTERSPEECH, 2022.
  18. R. Li et al., “A Policy-based Approach to the SpecAugment Method for Low Resource E2E ASR,” in APSIPA ASC, 2022.
  19. X. Song et al., “TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty,” ICASSP, 2022.
  20. G. Wang et al., “G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR,” in IEEE SLT, 2023.
  21. S. Zaiem et al., “Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations,” INTERSPEECH, 2023.
  22. F. Xiong et al., “Phonetic Analysis of Dysarthric Speech Tempo and Applications to Robust Personalised Dysarthric Speech Recognition,” in ICASSP, 2019.
  23. Z. Jin et al., “Adversarial Data Augmentation for Disordered Speech Recognition,” INTERSPEECH, 2021.
  24. J. Harvill et al., “Synthesis of New Words for Improved Dysarthric Speech Recognition on an Expanded Vocabulary,” in ICASSP, 2021.
  25. H. Kim et al., “Dysarthric Speech Database for Universal Access Research,” in INTERSPEECH, 2008.
  26. Y. Shao et al., “PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR,” INTERSPEECH, 2020.
  27. S. Watanabe et al., “ESPnet: End-to-End Speech Processing Toolkit,” in INTERSPEECH, 2018.
  28. E. D. Cubuk et al., “AutoAugment: Learning Augmentation Strategies from Data,” in CVPR, 2019.
  29. S. Lim et al., “Fast AutoAugment,” NeurIPS, 2019.
  30. E. D. Cubuk et al., “RandAugment: Practical Automated Data Augmentation with a Reduced Search Space,” in CVPR Workshops, 2020.
  31. R. Hataya et al., “Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation,” in ECCV, 2020.
  32. Z. Xinyu et al., “Adversarial AutoAugment,” in ICLR, 2020.
  33. V. Panayotov et al., “LibriSpeech: An ASR Corpus based on Public Domain Audio Books,” in ICASSP, 2015.
  34. J. J. Godfrey et al., “SWITCHBOARD: Telephone Speech Corpus for Research and Development,” in ICASSP, 1992.
  35. B. Vachhani et al., “Data Augmentation Using Healthy Speech for Dysarthric Speech Recognition,” in INTERSPEECH, 2018.
  36. S. Liu et al., “Recent Progress in the CUHK Dysarthric Speech Recognition System,” IEEE/ACM TASLP, 2021.
  37. R. J. Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,” Reinforcement Learning, 1992.
  38. L. Gillick et al., “Some Statistical Issues in the Comparison of Speech Recognition Algorithms,” in ICASSP, 1989.
  39. S. Young et al., “The HTK book,” Cambridge University Engineering Department, vol. 3, 2006.
  40. M. Geng et al., “Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition,” in INTERSPEECH, 2021.
  41. F. Xiong et al., “Source Domain Data Selection for Improved Transfer Learning Targeting Dysarthric Speech Recognition,” in ICASSP, 2020.
  42. D. Wang et al., “Improved End-to-End Dysarthric Speech Recognition via Meta-Learning Based Model Re-Initialization,” in ISCSLP, 2021.
  43. L. Violeta et al., “Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition,” in INTERSPEECH, 2022.
  44. Z. Jin et al., “Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition,” ICASSP, 2023.
  45. H. Wang et al., “DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model,” in INTERSPEECH, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zengrui Jin (30 papers)
  2. Xurong Xie (38 papers)
  3. Tianzi Wang (37 papers)
  4. Mengzhe Geng (42 papers)
  5. Jiajun Deng (75 papers)
  6. Guinan Li (23 papers)
  7. Shujie Hu (36 papers)
  8. Xunying Liu (92 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.