Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 56 tok/s

Gemini 2.5 Pro 39 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 155 tok/s Pro

GPT OSS 120B 476 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Scaling Rich Style-Prompted Text-to-Speech Datasets (2503.04713v1)

Published 6 Mar 2025 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio LLM to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

Collections

Summary

A Comprehensive Overview of Paralinguistic Speech Captions (ParaSpeechCaps) for Advanced Style-Prompted TTS

The paper presents Paralinguistic Speech Captions (ParaSpeechCaps), a substantial advancement in the domain of style-prompted Text-to-Speech (TTS) through the introduction of extensive datasets annotated with rich style captions. This work underscores two significant contributions: the assembly of a large-scale dataset capturing speech utterances with intricate style tags and the development of scalable methodologies to annotate these intricate style tags efficiently.

Dataset Construction and Methodology

The ParaSpeechCaps dataset stands out due to its broad array of 59 unique style tags, split into intrinsic tags tied to speaker identities and situational tags that vary per utterance. The dataset comprises both human-annotated and automatically scaled data portions, offering a total of 342 hours of high-quality human-labelled data and an impressive 2427 hours of automatically tagged data.

The intrinsic speech tags reflect features such as pitch and texture, while situational tags capture transient characteristics like emotions and expressiveness. Their unique classification acknowledges differences in intrinsic speaker qualities and situational speaking contexts. Existing TTS datasets often fall short by providing only basic style tags or limited scale for rich tags; ParaSpeechCaps addresses these gaps.

The authors introduce innovative data scaling methodologies, one for each class of tags. For intrinsic tags, they leverage perceptual speaker similarity, employing VoxSim to propagate tags across similar speakers automatically, expanding the dataset robustly. For situational tags, a multi-step approach involving Expressivity Filtering, Semantic Matching, and Acoustic Matching ensures data quality while significantly scaling size. This rigorous methodological framework underpins the dataset's ability to reflect realistic and varied speech styles accurately.

Experimental Results and Model Performance

The introduction of ParaSpeechCaps demonstrates compelling improvements in style-prompted TTS models, especially when leveraging updated models like Parler-TTS. Two models trained on distinct portions of ParaSpeechCaps were evaluated: a Base model (using human data) and a Scaled model (using both human and scaled data). When compared to existing benchmarks, these models clearly enhance style consistency and speech quality. The Scaled model, in particular, sets new standards in both Consistency Mean Opinion Score (CMOS) and Naturalness Mean Opinion Score (NMOS), substantiating the efficacy of scaling methods for dataset construction.

Interestingly, the data also illustrates how rich style annotations can affect the perceived intelligibility and suggests that variations reflecting natural speech intricacies might lower intelligibility scores. This underscores an ongoing trade-off in TTS research between authenticity of style portrayal and the clarity of synthesized speech.

Implications and Future Directions

The implications of this research extend both theoretically and practically. It highlights the importance of comprehensive datasets in advancing TTS technologies, enabling more nuanced and human-like speech synthesis. Practically, speech interfaces enriched with such data can cater to diverse use cases ranging from digital assistants to entertainment and education, enhancing user experiences by simulating realistic, context-sensitive spoken interactions.

Theoretically, the scaling methodologies provide a robust foundation for future work to amplify dataset sizes without proportionate increases in manual annotation costs. Furthermore, the work paves the way for even broader integrations of multilingual capabilities into style-prompted synthesis, an exciting potential avenue for next-generation TTS models.

Conclusion

ParaSpeechCaps significantly advances the field of style-prompted TTS by introducing a rich, scalable dataset and robust methodologies for data annotation. Through comprehensive experimentation and validation, the paper demonstrates enhanced model performance in style consistency and speech quality. This comprehensive framework and dataset mark a notable step forward in the understanding and synthesis of expressive speech, contributing valuable insights and tools for future explorations within computational linguistics and AI-driven communication technologies.