LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning (2406.07969v1)

Published 12 Jun 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset. Our corpus, LibriTTS-P, is available at https://github.com/line/LibriTTS-P.

Authors (5)

Masaya Kawamura (14 papers)
Ryuichi Yamamoto (34 papers)
Yuma Shirahata (10 papers)
Takuya Hasumi (6 papers)
Kentaro Tachibana (17 papers)

Citations (3)

View on Semantic Scholar

Summary

An Analytical Overview of "LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning"

The paper entitled "LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning" presents the development and evaluation of a novel dataset designed to advance the capabilities of text-to-speech (TTS) and style captioning systems. The proposed dataset, LibriTTS-P, is constructed upon the foundation of the pre-existing LibriTTS-R corpus and incorporates a diverse array of prompt annotations. These annotations are invaluable for training systems that can differentiate and reproduce nuanced speaking styles and speaker identities.

Core Contributions and Methodological Approach

LibriTTS-P distinguishes itself by providing richly annotated prompts that capture both speaking styles and speaker identities. This bi-level annotation is not present in many existing datasets, which tend to focus predominantly on style-related prompts without fully encapsulating the breadth of speaker characteristics. To achieve this comprehensive annotation, a hybrid methodological approach is adopted, which involves:

Manual Annotations: Skilled annotators provide speaker-level prompts that reflect human perceptions of speaker characteristics, such as gender, age, and vocal demeanor. This manual process ensures that intricate and subjective nuances of human speech are captured accurately.
Synthetic Annotations: Automated processes are used to generate style prompts at the utterance level, capturing dynamic speech attributes such as pitch, speed, and loudness. This synthesis leverages statistical analyses and predefined templates, augmented by LLMs, to convert these attributes into natural language prompts.

The contrasting levels of annotation—one being deeply subjective and the other smoothly algorithmic—ensure a dataset that serves the dual needs of capturing innate speaker identity and observable speech style in a robust and scalable fashion.

Experimental Results and Observations

The utilization of LibriTTS-P in experiments demonstrates notable improvements in prompt-based controllable TTS systems. Models trained using LibriTTS-P surpass those trained with existing datasets, such as PromptSpeech, in terms of synthesizing speech with greater naturalness and fidelity to the prompts. This superiority is quantitatively validated through higher mean opinion scores (MOS) for both naturalness and prompt consistency.

In the field of style captioning, LibriTTS-P-trained models generate captions that are significantly more descriptive while maintaining accuracy, as evidenced by subjective evaluations and metrics such as BLEU and BERT-Score. The inclusion of expansive prompt variations allows these models to produce more detailed and human-like descriptions of speech characteristics.

Implications and Future Directions

The creation and deployment of LibriTTS-P hold several implications for the design of future TTS systems and other applications involving speech analysis. By effectively integrating human and synthetic annotations, LibriTTS-P facilitates the training of models that are better aligned with human speech perception.

Further research could explore expanding the dataset with even more diverse data, potentially incorporating cross-linguistic prompts to examine parameter transferability across different languages and cultural contexts. Additionally, exploring the use of free-form text descriptions might unlock further potential in capturing the dynamic nature of speech. This avenue, while promising, poses challenges in maintaining annotation efficiency and consistency.

In conclusion, LibriTTS-P represents a significant stride in the construction of datasets that bridge the gap between synthetic and human-like speech production. It sets a new standard in dataset construction for speech applications and paves the way for future explorations into more adaptable, perceptually aligned TTS systems and speech analysis methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - line/LibriTTS-P: LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning (114 stars)