Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 164 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations (2404.14946v1)

Published 23 Apr 2024 in cs.SD, cs.CL, and eess.AS

Abstract: While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and comprehensive labeling framework is proposed for textual expressiveness. We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc. Then we employ LLMs and prompt them with a few manual annotation examples for batch annotation. The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations. Therefore, StoryTTS can aid future ETTS research to fully mine the abundant intrinsic textual and acoustic features. Experiments are conducted to validate that TTS models can generate speech with improved expressiveness when integrating with the annotated textual labels in StoryTTS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech. 2017, pp. 4006–4010, ISCA.
  2. “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in Proc. ICASSP. 2018, pp. 4779–4783, IEEE.
  3. “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
  4. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML. 2021, vol. 139, pp. 5530–5540, PMLR.
  5. “Rich prosody diversity modelling with phone-level mixture density network,” in Proc. Interspeech. 2021, pp. 3136–3140, ISCA.
  6. “Unsupervised word-level prosody tagging for controllable speech synthesis,” in Proc. ICASSP. 2022, pp. 7597–7601, IEEE.
  7. “Pre-trained text embeddings for enhanced text-to-speech synthesis,” in Proc. Interspeech. 2019, pp. 4430–4434, ISCA.
  8. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. ACL, 2019, pp. 4171–4186.
  9. “Syntaspeech: Syntax-aware generative adversarial text-to-speech,” in Proc. IJCAI. 2022, pp. 4468–4474, ijcai.org.
  10. “Reading expressiveness,” Fluency instruction: Research-based best practices, vol. 35, 2012.
  11. “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  12. “The blizzard challenge 2013,” 2014.
  13. “Hi-Fi multi-speaker english TTS dataset,” in Proc. Interspeech. 2021, pp. 2776–2780, ISCA.
  14. “LibriTTS: A corpus derived from librispeech for text-to-speech,” in Proc. Interspeech. 2019, pp. 1526–1530, ISCA.
  15. “AISHELL-3: A multi-speaker mandarin TTS corpus,” in Proc. Interspeech. 2021, pp. 2756–2760, ISCA.
  16. “Biaobei dataset,” https://en.data-baker.com/datasets/freeDatasets, 2017.
  17. “Robust speech recognition via large-scale weak supervision,” in Proc. ICML. 2023, vol. 202, pp. 28492–28518, PMLR.
  18. “Is GPT-3 a good data annotator?,” in Proc. ACL, 2023, pp. 11173–11195.
  19. “Language models are few-shot learners,” in Proc. NeurIPS, 2020.
  20. OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023.
  21. Anthropic, “Model card and evaluations for claude models,” https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf.
  22. “VQTTS: high-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature,” in Proc. Interspeech. 2022, pp. 1596–1600, ISCA.
  23. “vq-wav2vec: Self-supervised learning of discrete speech representations,” in Proc. ICLR. 2020, OpenReview.net.
  24. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Proc. NeurIPS, vol. 33, pp. 12449–12460, 2020.
  25. “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech, Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, Eds. 2021, pp. 2426–2430, ISCA.
  26. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  27. “A pitch extraction algorithm tuned for automatic speech recognition,” in Proc. ICASSP. 2014, pp. 2494–2498, IEEE.
  28. “Attention is all you need,” in Proc. NeurIPS, 2017, pp. 5998–6008.
  29. “DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech,” in Proc. Interspeech. 2023, pp. 616–620, ISCA.
  30. “Montreal forced aligner: Trainable text-speech alignment using kaldi,” in Proc. Interspeech. 2017, pp. 498–502, ISCA.
  31. “The Kaldi speech recognition toolkit,” in Proc. ASRU. 2011, IEEE.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.