- The paper introduces FLEURS-R, a multilingual corpus enhanced by Miipher that improves audio fidelity for speech generation tasks.
- The study leverages a modified Miipher model with USM to restore 102 languages, significantly reducing noise and reverberation.
- Rigorous evaluations show FLEURS-R maintains similar ASR accuracy while improving TTS naturalness, aiding low-resource language research.
Overview of FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks
The paper introduces FLEURS-R, a speech restoration-enhanced version of the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) corpus. FLEURS-R retains the N-way parallel speech corpus characteristics of FLEURS, spanning 102 languages, but significantly improves upon audio fidelity through the application of the Miipher speech restoration model. The research aims to propel advancements in speech technology for a diverse range of languages, particularly those with limited resources, and foster further research into text-to-speech (TTS) and other speech generation tasks.
Key Features and Contributions
The FLEURS-R corpus holds several distinctive qualities:
- Multilingual Parallel Speech: Encompasses 102 languages, with 80% being low-resource, making it advantageous for various speech generation tasks.
- Enhanced Audio Quality: Achieved through the Miipher restoration model, resulting in significantly reduced noise and reverberation and a higher sampling rate (24 kHz).
- Public Accessibility: The corpus is available via Hugging Face, ensuring its use for broader research and development opportunities.
Methodology and Innovations
Speech Restoration Pipeline
- Speech Restoration Model:
- Miipher Model: Used for audio enhancement, Miipher involves a sophisticated architecture where acoustic features are extracted using w2v-BERT and cleaned with DF-Conformer. Features are then conditioned on speaker identity and text, and the WaveFit neural vocoder ultimately generates the refined audio.
- Model Extension for Multilingual Support: Given Miipher’s original limitation to English, the model was modified to use the Universal Speech Model (USM), pre-trained on 12 million hours of speech from over 300 languages. This adjustment ensured effective multilingual speech restoration.
- Data Processing:
- The restoration process was applied to the entire FLEURS corpus, leading to an enhanced version termed FLEURS-R.
- The samples underwent ASR-based filtering to ensure quality, despite potential restoration artifacts. A comprehensive list of filtered samples was also created for transparency.
Evaluation and Results
The FLEURS-R corpus was subjected to rigorous evaluations, addressing both intelligibility and naturalness of speech.
ASR Evaluation
Character Error Rates (CERs) were computed across all 102 languages using the Maestro-U ASR model. The results demonstrated that FLEURS-R maintained semantic integrity with CERs very close to the original FLEURS corpus (9.74% vs. 9.67%). Notable improvements in CERs were observed in several languages, particularly Xhosa and Macedonian, suggesting enhanced speech clarity due to the restoration process.
Speech Naturalness
Naturalness was assessed using the SQuId model, which estimates speech MOS on a 5-point scale. On average, FLEURS-R achieved a 0.2 point higher score in naturalness than FLEURS. This improvement was consistent across nearly all languages, particularly notable in South Asian languages.
TTS Model Evaluations
The paper also involved training TTS models using both FLEURS and FLEURS-R. The results indicated an improvement in synthesized speech naturalness when training was conducted on FLEURS-R, achieving an overall score of 3.89 versus 3.79 for FLEURS. Additionally, the ASR evaluation of TTS-generated speech suggested consistency in intelligibility across models trained on both variants.
Implications and Future Directions
FLEURS-R represents a meaningful step towards improving speech generation capabilities in a multilingual context, emphasizing language inclusivity. The enhancement of audio quality without compromising semantic content broadens the utility of the corpus for various speech generation tasks such as TTS, speech-to-speech translation, and voice conversion.
Future work could explore:
- Optimized ASR models: Tailored to better handle multilingual datasets and the unique challenges posed by restored audio.
- Further Refinement: Improving TTS model architectures for closer alignment with natural speech to reduce error rates further.
- Expanding Multilingual Resources: Enhancing the corpus with additional low-resource languages and exploring zero-shot and few-shot learning paradigms in more depth.
In conclusion, FLEURS-R, with its superior audio quality and extensive multilingual coverage, sets a benchmark for future research and development in multilingual speech technologies. The paper's detailed methodologies and comprehensive evaluations underscore its significance as a resource for advancing speech generation tasks in diverse linguistic landscapes.