LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech (1904.02882v1)

Published 5 Apr 2019 in cs.SD and eess.AS

Abstract: This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/.

Citations (811)

View on Semantic Scholar

Summary

The paper introduces a 24 kHz, sentence-aligned corpus that preserves key textual features and filters out noise for improved TTS performance.
The authors demonstrate the corpus's efficacy using GMVAE-Tacotron models and WaveRNN vocoders, achieving higher naturalness scores, especially for female speakers.
The dataset’s large scale with 585 hours from 2,456 speakers and balanced demographics lays a robust foundation for advancing neural TTS research.

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

The paper "LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech" by Heiga Zen et al. introduces the LibriTTS corpus, a dataset specifically designed to address the shortcomings of the LibriSpeech corpus for text-to-speech (TTS) research. Although LibriSpeech—originally intended for automatic speech recognition (ASR)—has been extensively employed in TTS research, it harbors several limitations that impede its optimal use for TTS. This paper methodically addresses these limitations and presents experimental results to highlight the benefits and advancements offered by the LibriTTS corpus.

Key Contributions

The primary contributions of this paper include:

Higher Sampling Rate: The LibriTTS corpus audio files are at a 24 kHz sampling rate, opposed to LibriSpeech's 16 kHz, allowing for higher-quality synthesized speech.
Sentence-Aligned Splitting: Speech is split at sentence boundaries, preserving long-term speech characteristics essential for natural-sounding synthesis.
Preservation of Textual Features: The corpus retains capitalization, punctuation, and contextual information, which are pivotal for learning prosodic characteristics in TTS models.
Noise Filtering: Utterances with significant background noise have been filtered out, unlike the LibriSpeech corpus which includes some noisy samples in its "clean" subsets.

Dataset Statistics

The LibriTTS dataset consists of 585 hours of speech data from 2,456 speakers, segmented into seven subsets, including development, test, and train subsets (e.g., dev-clean, test-clean, train-clean-360). The balance in gender representation and the large size of the dataset make it a robust resource for TTS research.

Experimental Evaluation

The authors integrated the Gaussian mixture variational auto-encoder (GMVAE)-Tacotron models, coupled with WaveRNN-based neural vocoders, to evaluate the efficacy of the LibriTTS corpus. Subjective experiments were conducted using mean opinion scores (MOS) to assess naturalness. Key findings include:

Superior Naturalness: Models trained on the 24 kHz LibriTTS corpus achieved higher MOS, particularly for female speakers, underlining the benefit of the higher sampling rate.
Comparison with LibriSpeech: LibriTTS (16 kHz down-sampled) outperformed LibriSpeech (16 kHz), signifying the advantages of sentence-aligned data and preserved textual features.
Gender Disparity: Interestingly, the naturalness scores for synthesized speech from male speakers were lower compared to female speakers, suggesting potential for further model adjustments to improve gender-based performance.

Implications and Future Work

The introduction of LibriTTS marks a significant step in advancing TTS research by providing a high-quality, publicly available dataset. Practical implications include:

Enhanced Model Training: The corpus's structure and quality enable more effective training of end-to-end neural TTS models.
Prosodic and Stylization Improvements: The preserved textual features allow models to learn and synthesize nuanced prosodic elements more naturally.
Expanded Research Frontiers: Future work can leverage LibriTTS for a multitude of challenging tasks such as style transfer, prosody modeling, and multi-lingual TTS systems.

Further research based on the LibriTTS corpus can explore:

Speaker Imbalance: Addressing the imbalance in audio duration per speaker to ensure more uniform model training.
Extended Linguistic Features: Investigating the impact of additional linguistic features on speech synthesis quality.
Broader Language Coverage: Expanding the corpus to include more languages and additional speakers to foster broader applicability in global TTS research.

Conclusion

LibriTTS constitutes a robust and meticulously curated dataset that meets the specific needs of TTS research. The authors demonstrate through rigorous experimentation that the LibriTTS corpus significantly enhances the naturalness of synthesized speech, thereby setting a new standard for dataset quality in the TTS community. This corpus is anticipated to spur future innovations and facilitate advanced research in neural TTS systems. The dataset is made available for public use, providing a valuable resource for researchers aiming to push the boundaries of TTS technology.

PDF Markdown