UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization (2401.14664v1)
Abstract: Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generative Adversarial Network) approaches, but the approach is still limited by training inefficiency caused by the cascaded pipeline and auxiliary tasks of the content encoder, which may in turn affect the quality of reconstruction. Inspired by self-supervised speech representation learning and discrete speech units, we propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement and utilizes speech units to constrain the dysarthric content restoration in a discrete linguistic space. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that Unit-DSR outperforms competitive baselines in terms of content restoration, reaching a 28.2% relative average word error rate reduction when compared to original dysarthric speech, and shows robustness against speed perturbation and noise.
- “Dysarthria and dysphagia are highly prevalent among various types of neuromuscular diseases,” Disability and rehabilitation, vol. 36, no. 15, 2014.
- “Improving pronunciation clarity of dysarthric speech using cyclegan with multiple speakers,” in IEEE GCCE, 2020.
- “The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition,” in Interspeech, 2022.
- “Intelligibility improvement of dysarthric speech using mmse discogan,” in IEEE SPCOM, 2020.
- “End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction,” in IEEE ICASSP, 2020.
- “Speaker identity preservation in dysarthric speech reconstruction by adversarial speaker adaptation,” in IEEE ICASSP, 2022.
- “Parrotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation,” arXiv preprint arXiv:1904.04169, 2019.
- “Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech,” in Interspeech, 2021.
- “Extending parrotron: An end-to-end, speech conversion and speech recognition model for atypical speech,” in IEEE ICASSP, 2021.
- “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations,” in Interspeech, 2021.
- “SUPERB-SG: Enhanced speech processing universal PERformance benchmark for semantic and generative capabilities,” in ACL, 2022.
- “Textless speech-to-speech translation on real data,” in NAACL, 2022.
- “Improving speech-to-speech translation through unlabeled text,” in IEEE ICASSP, 2023.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, 2021.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, 2020.
- “SUPERB: Speech Processing Universal PERformance Benchmark,” in Interspeech, 2021.
- “Exploring self-supervised pre-trained asr models for dysarthric and elderly speech recognition,” in IEEE ICASSP, 2023.
- “Analysing discrete self supervised speech representation for spoken language modeling,” in IEEE ICASSP, 2023.
- “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in ACL, 2021.
- “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- “Dysarthric speech database for universal access research,” in Ninth Annual Conference of the International Speech Communication Association, 2008.
- “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in IEEE ICASSP, 2018.
- “Jasper: An end-to-end convolutional neural acoustic model,” arXiv preprint arXiv:1904.03288, 2019.