Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Published 20 May 2025 in cs.CL, cs.SD, and eess.AS | (2505.14874v3)

Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

The paper titled "Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages" presents an innovative approach to improve automatic speech recognition (ASR) systems for dysarthric speech, particularly in non-English and low-resource languages. Traditionally, dysarthric speech poses significant challenges to ASR systems due to the scarcity of annotated dysarthric datasets. This work addresses this issue by proposing a method that leverages voice conversion (VC) techniques to generate dysarthric-like speech, thereby enhancing the training data for multilingual ASR models.

Methodology

The study's core methodology involves the fine-tuning of a voice conversion model using English dysarthric speech data (from the UASpeech dataset) to capture and encode both speaker characteristics and prosodic distortions typically present in dysarthric speech. This model is then employed to convert healthy speech in other languages (using the FLEURS dataset for Spanish, Italian, and Tamil) into dysarthric-like speech. The augmented datasets are then used to fine-tune a multilingual ASR model, known as Massively Multilingual Speech (MMS), to improve its performance on dysarthric speech recognition across these languages.

Experimental Setup and Results

The authors employ multiple datasets to test their approach, including PC-GITA for Spanish, EasyCall for Italian, and SSNCE for Tamil. The study reveals that using VC to incorporate both speaker and prosodic characteristics into the conversion process significantly enhances ASR performance compared to baseline models and traditional augmentation techniques like speed and tempo perturbation. Specifically, the proposed speaker-prosody VC approach demonstrated marked improvements in character error rates (CER), indicating better transcription accuracy, especially in more severe dysarthria cases.

Implications and Future Directions

The paper provides compelling evidence that voice conversion can serve as an effective augmentation strategy in the context of ASR for dysarthric speech. This work not only underscores the potential of VC in simulating the acoustic traits of dysarthria but also highlights the limitations of current ASR systems in addressing these speech challenges without sufficient data.

The implications of this research are twofold. Practically, the approach could significantly improve speech technology access for individuals with dysarthria, especially in underrepresented linguistic communities. Theoretically, it opens pathways for further research into cross-linguistic application of VC techniques, potentially bridging the gap between high- and low-resource languages in ASR technology.

Looking forward, there are several avenues for future exploration. One critical area is the refinement of VC models to better maintain linguistic naturalness while embedding dysarthric characteristics, as current models tend to compromise on language-specific nuances. Expanding the approach to more languages and integrating additional dysarthric datasets could also enhance generalizability. Ultimately, this research lays a promising foundation for developing more inclusive ASR systems, encouraging further innovations in the synthesis and recognition of atypical speech.