Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions (2309.08140v2)

Published 15 Sep 2023 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions” In Proc. ICASSP, 2018, pp. 4779–4783
  2. “Naturalspeech: End-to-end text to speech synthesis with human-level quality” In arXiv preprint arXiv:2205.04421, 2022
  3. “A survey on neural speech synthesis” In arXiv preprint arXiv:2106.15561, 2021
  4. “PromptTTS: Controllable text-to-speech with text descriptions” In Proc. ICASSP, 2023, pp. 1–5
  5. “InstructTTS: Modelling expressive tts in discrete latent space with natural language style prompt” In arXiv preprint arXiv:2301.13662, 2023
  6. “PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions” In arXiv preprint arXiv:2305.19522, 2023
  7. “Language models are few-shot learners” In Proc. NeurIPS, 2020, pp. 1877–1901
  8. “Training language models to follow instructions with human feedback” In Proc. NeurIPS 35, 2022, pp. 27730–27744
  9. “Llama 2: Open foundation and fine-tuned chat models” In arXiv preprint arXiv:2307.09288, 2023
  10. Christopher M Bishop “Mixture density networks” In Aston University, Birmingham UK, 1994
  11. “Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis” In Proc. ICML, 2018, pp. 5180–5189
  12. “Speaker generation” In Proc. ICASSP, 2022, pp. 7897–7901
  13. “LibriTTS-R: A restored multi-speaker text-to-speech corpus” In Proc. Interspeech, 2023, pp. 5496–5500 DOI: 10.21437/Interspeech.2023-1584
  14. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In Proc. NAACL-HLT, 2019, pp. 4171–4186
  15. Christopher M. Bishop “Mixture density networks”, 1994
  16. “Conformer: Convolution-augmented transformer for speech recognition” In Proc. Interspeech, 2020, pp. 5036–5040 DOI: 10.21437/Interspeech.2020-3015
  17. “FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech” In Proc. ICLR, 2021
  18. “DiffSinger: Singing voice synthesis via shallow diffusion mechanism” In AAAI 36.10, 2022, pp. 11020–11028
  19. Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Proc. NeurIPS 33, 2020, pp. 6840–6851
  20. “Phone-level prosody modelling with GMM-based MDN for diverse and controllable speech synthesis” In IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 30, 2021, pp. 190–201
  21. “Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations” In arXiv preprint arXiv:2303.01664, 2023
  22. “LibriTTS: A corpus derived from LibriSpeech for text-to-speech” In Proc. Interspeech, 2019, pp. 1526–1530 DOI: 10.21437/Interspeech.2019-2441
  23. “Extraction of everyday expression associated with voice quality of normal utterance” In Journal of. Acous. Soc. (in Japanese) 55.6, 1999, pp. 405–411
  24. “Text-to-speech technology to control speaker indivisuality with intuitive expressions” In Toshiba Review (in Japanese) 71, 2016, pp. 80–83
  25. “Montreal forced aligner: Trainable text-speech alignment using kaldi.” In Proc. Interspeech 2017, 2017, pp. 498–502
  26. Masanori Morise, Fumiya Yokomori and Kenji Ozawa “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications” In IEICE Trans. on Information and Systems 99.7, 2016, pp. 1877–1884
  27. “Continuous F0 modeling for HMM based statistical parametric speech synthesis” In IEEE Trans. on Audio, Speech, and Lang. Process. 19.5, 2010, pp. 1071–1079
  28. Vinod Nair and Geoffrey E Hinton “Rectified linear units improve restricted boltzmann machines” In Proc. ICML, 2010, pp. 807–814
  29. “Decoupled weight decay regularization” In Proc. ICLR, 2019
  30. “Attention is all you need” In Proc. NIPS, 2017, pp. 5998–6008
  31. “BigVGAN: A Universal Neural Vocoder with Large-Scale Training” In Proc. ICLR, 2023
  32. Xin Wang, Shinji Takaki and Junichi Yamagishi “Neural source-filter waveform models for statistical parametric speech synthesis” In IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 28, 2019, pp. 402–415
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Reo Shimizu (2 papers)
  2. Ryuichi Yamamoto (34 papers)
  3. Masaya Kawamura (14 papers)
  4. Yuma Shirahata (10 papers)
  5. Hironori Doi (1 paper)
  6. Tatsuya Komatsu (29 papers)
  7. Kentaro Tachibana (17 papers)
Citations (17)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com