PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions (2309.08140v2)
Abstract: We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that allows control over speaker identity using natural language descriptions. To control speaker identity within the prompt-based TTS framework, we introduce the concept of speaker prompt, which describes voice characteristics (e.g., gender-neutral, young, old, and muffled) designed to be approximately independent of speaking style. Since there is no large-scale dataset containing speaker prompts, we first construct a dataset based on the LibriTTS-R corpus with manually annotated speaker prompts. We then employ a diffusion-based acoustic model with mixture density networks to model diverse speaker factors in the training data. Unlike previous studies that rely on style prompts describing only a limited aspect of speaker individuality, such as pitch, speaking speed, and energy, our method utilizes an additional speaker prompt to effectively learn the mapping from natural language descriptions to the acoustic features of diverse speakers. Our subjective evaluation results show that the proposed method can better control speaker characteristics than the methods without the speaker prompt. Audio samples are available at https://reppy4620.github.io/demo.promptttspp/.
- “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions” In Proc. ICASSP, 2018, pp. 4779–4783
- “Naturalspeech: End-to-end text to speech synthesis with human-level quality” In arXiv preprint arXiv:2205.04421, 2022
- “A survey on neural speech synthesis” In arXiv preprint arXiv:2106.15561, 2021
- “PromptTTS: Controllable text-to-speech with text descriptions” In Proc. ICASSP, 2023, pp. 1–5
- “InstructTTS: Modelling expressive tts in discrete latent space with natural language style prompt” In arXiv preprint arXiv:2301.13662, 2023
- “PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions” In arXiv preprint arXiv:2305.19522, 2023
- “Language models are few-shot learners” In Proc. NeurIPS, 2020, pp. 1877–1901
- “Training language models to follow instructions with human feedback” In Proc. NeurIPS 35, 2022, pp. 27730–27744
- “Llama 2: Open foundation and fine-tuned chat models” In arXiv preprint arXiv:2307.09288, 2023
- Christopher M Bishop “Mixture density networks” In Aston University, Birmingham UK, 1994
- “Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis” In Proc. ICML, 2018, pp. 5180–5189
- “Speaker generation” In Proc. ICASSP, 2022, pp. 7897–7901
- “LibriTTS-R: A restored multi-speaker text-to-speech corpus” In Proc. Interspeech, 2023, pp. 5496–5500 DOI: 10.21437/Interspeech.2023-1584
- “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In Proc. NAACL-HLT, 2019, pp. 4171–4186
- Christopher M. Bishop “Mixture density networks”, 1994
- “Conformer: Convolution-augmented transformer for speech recognition” In Proc. Interspeech, 2020, pp. 5036–5040 DOI: 10.21437/Interspeech.2020-3015
- “FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech” In Proc. ICLR, 2021
- “DiffSinger: Singing voice synthesis via shallow diffusion mechanism” In AAAI 36.10, 2022, pp. 11020–11028
- Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Proc. NeurIPS 33, 2020, pp. 6840–6851
- “Phone-level prosody modelling with GMM-based MDN for diverse and controllable speech synthesis” In IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 30, 2021, pp. 190–201
- “Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations” In arXiv preprint arXiv:2303.01664, 2023
- “LibriTTS: A corpus derived from LibriSpeech for text-to-speech” In Proc. Interspeech, 2019, pp. 1526–1530 DOI: 10.21437/Interspeech.2019-2441
- “Extraction of everyday expression associated with voice quality of normal utterance” In Journal of. Acous. Soc. (in Japanese) 55.6, 1999, pp. 405–411
- “Text-to-speech technology to control speaker indivisuality with intuitive expressions” In Toshiba Review (in Japanese) 71, 2016, pp. 80–83
- “Montreal forced aligner: Trainable text-speech alignment using kaldi.” In Proc. Interspeech 2017, 2017, pp. 498–502
- Masanori Morise, Fumiya Yokomori and Kenji Ozawa “WORLD: a vocoder-based high-quality speech synthesis system for real-time applications” In IEICE Trans. on Information and Systems 99.7, 2016, pp. 1877–1884
- “Continuous F0 modeling for HMM based statistical parametric speech synthesis” In IEEE Trans. on Audio, Speech, and Lang. Process. 19.5, 2010, pp. 1071–1079
- Vinod Nair and Geoffrey E Hinton “Rectified linear units improve restricted boltzmann machines” In Proc. ICML, 2010, pp. 807–814
- “Decoupled weight decay regularization” In Proc. ICLR, 2019
- “Attention is all you need” In Proc. NIPS, 2017, pp. 5998–6008
- “BigVGAN: A Universal Neural Vocoder with Large-Scale Training” In Proc. ICLR, 2023
- Xin Wang, Shinji Takaki and Junichi Yamagishi “Neural source-filter waveform models for statistical parametric speech synthesis” In IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 28, 2019, pp. 402–415
- Reo Shimizu (2 papers)
- Ryuichi Yamamoto (34 papers)
- Masaya Kawamura (14 papers)
- Yuma Shirahata (10 papers)
- Hironori Doi (1 paper)
- Tatsuya Komatsu (29 papers)
- Kentaro Tachibana (17 papers)