FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks (2408.06227v1)

Published 12 Aug 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: This paper introduces FLEURS-R, a speech restoration applied version of the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) corpus. FLEURS-R maintains an N-way parallel speech corpus in 102 languages as FLEURS, with improved audio quality and fidelity by applying the speech restoration model Miipher. The aim of FLEURS-R is to advance speech technology in more languages and catalyze research including text-to-speech (TTS) and other speech generation tasks in low-resource languages. Comprehensive evaluations with the restored speech and TTS baseline models trained from the new corpus show that the new corpus obtained significantly improved speech quality while maintaining the semantic contents of the speech. The corpus is publicly released via Hugging Face.

Authors (7)

Min Ma (14 papers)
Yuma Koizumi (39 papers)
Shigeki Karita (15 papers)
Heiga Zen (36 papers)
Jason Riesa (20 papers)
Haruko Ishikawa (5 papers)
Michiel Bacchiani (16 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces FLEURS-R, a multilingual corpus enhanced by Miipher that improves audio fidelity for speech generation tasks.
The study leverages a modified Miipher model with USM to restore 102 languages, significantly reducing noise and reverberation.
Rigorous evaluations show FLEURS-R maintains similar ASR accuracy while improving TTS naturalness, aiding low-resource language research.

Overview of FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks

The paper introduces FLEURS-R, a speech restoration-enhanced version of the Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS) corpus. FLEURS-R retains the N-way parallel speech corpus characteristics of FLEURS, spanning 102 languages, but significantly improves upon audio fidelity through the application of the Miipher speech restoration model. The research aims to propel advancements in speech technology for a diverse range of languages, particularly those with limited resources, and foster further research into text-to-speech (TTS) and other speech generation tasks.

Key Features and Contributions

The FLEURS-R corpus holds several distinctive qualities:

Multilingual Parallel Speech: Encompasses 102 languages, with 80% being low-resource, making it advantageous for various speech generation tasks.
Enhanced Audio Quality: Achieved through the Miipher restoration model, resulting in significantly reduced noise and reverberation and a higher sampling rate (24 kHz).
Public Accessibility: The corpus is available via Hugging Face, ensuring its use for broader research and development opportunities.

Methodology and Innovations

Speech Restoration Pipeline

Speech Restoration Model:
- Miipher Model: Used for audio enhancement, Miipher involves a sophisticated architecture where acoustic features are extracted using w2v-BERT and cleaned with DF-Conformer. Features are then conditioned on speaker identity and text, and the WaveFit neural vocoder ultimately generates the refined audio.
- Model Extension for Multilingual Support: Given Miipher’s original limitation to English, the model was modified to use the Universal Speech Model (USM), pre-trained on 12 million hours of speech from over 300 languages. This adjustment ensured effective multilingual speech restoration.
Data Processing:
- The restoration process was applied to the entire FLEURS corpus, leading to an enhanced version termed FLEURS-R.
- The samples underwent ASR-based filtering to ensure quality, despite potential restoration artifacts. A comprehensive list of filtered samples was also created for transparency.

Evaluation and Results

The FLEURS-R corpus was subjected to rigorous evaluations, addressing both intelligibility and naturalness of speech.

ASR Evaluation

Character Error Rates (CERs) were computed across all 102 languages using the Maestro-U ASR model. The results demonstrated that FLEURS-R maintained semantic integrity with CERs very close to the original FLEURS corpus (9.74% vs. 9.67%). Notable improvements in CERs were observed in several languages, particularly Xhosa and Macedonian, suggesting enhanced speech clarity due to the restoration process.

Speech Naturalness

Naturalness was assessed using the SQuId model, which estimates speech MOS on a 5-point scale. On average, FLEURS-R achieved a 0.2 point higher score in naturalness than FLEURS. This improvement was consistent across nearly all languages, particularly notable in South Asian languages.

TTS Model Evaluations

The paper also involved training TTS models using both FLEURS and FLEURS-R. The results indicated an improvement in synthesized speech naturalness when training was conducted on FLEURS-R, achieving an overall score of 3.89 versus 3.79 for FLEURS. Additionally, the ASR evaluation of TTS-generated speech suggested consistency in intelligibility across models trained on both variants.

Implications and Future Directions

FLEURS-R represents a meaningful step towards improving speech generation capabilities in a multilingual context, emphasizing language inclusivity. The enhancement of audio quality without compromising semantic content broadens the utility of the corpus for various speech generation tasks such as TTS, speech-to-speech translation, and voice conversion.

Future work could explore:

Optimized ASR models: Tailored to better handle multilingual datasets and the unique challenges posed by restored audio.
Further Refinement: Improving TTS model architectures for closer alignment with natural speech to reduce error rates further.
Expanding Multilingual Resources: Enhancing the corpus with additional low-resource languages and exploring zero-shot and few-shot learning paradigms in more depth.

In conclusion, FLEURS-R, with its superior audio quality and extensive multilingual coverage, sets a benchmark for future research and development in multilingual speech technologies. The paper's detailed methodologies and comprehensive evaluations underscore its significance as a resource for advancing speech generation tasks in diverse linguistic landscapes.

PDF Markdown

Related Papers

Tweets

https://twitter.com/yuma_koizumi/status/1823236457601077503

https://twitter.com/fly51fly/status/1823477262991438262

https://twitter.com/knishimae0531/status/1823327848213754100

https://twitter.com/AudioAndSpeech/status/1823281972770181462