Controllable Speaking Styles Using a Large Language Model (2305.10321v2)
Abstract: Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text. Such models jointly learn a latent acoustic space during training, which can be sampled from during inference. Controlling these models during inference typically requires finding an appropriate reference utterance, which is non-trivial. Large generative LLMs have shown excellent performance in various language-related tasks. Given only a natural language query text (the prompt), such models can be used to solve specific, context-dependent tasks. Recent work in TTS has attempted similar prompt-based control of novel speaking style generation. Those methods do not require a reference utterance and can, under ideal conditions, be controlled with only a prompt. But existing methods typically require a prompt-labelled speech corpus for jointly training a prompt-conditioned encoder. In contrast, we instead employ an LLM to directly suggest prosodic modifications for a controllable TTS model, using contextual information provided in the prompt. The prompt can be designed for a multitude of tasks. Here, we give two demonstrations: control of speaking style; prosody appropriate for a given dialogue context. The proposed method is rated most appropriate in 50% of cases vs. 31% for a baseline model.
- “Relevance and prosody,” Journal of pragmatics, vol. 38, no. 10, pp. 1559–1579, 2006.
- Jennifer Cole, “Prosody in context: a review,” Language, Cognition and Neuroscience, vol. 30, no. 1-2, pp. 1–31, 2015.
- “Experimental and theoretical advances in prosody: A review,” Language and cognitive processes, vol. 25, no. 7-9, pp. 905–945, 2010.
- “Using pupillometry to measure the cognitive load of synthetic speech,” Proc. Interspeech 2018, pp. 2838–2842, 2018.
- “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in international conference on machine learning. PMLR, 2018, pp. 4693–4702.
- “Effective use of variational embedding capacity in expressive end-to-end speech synthesis,” CoRR, vol. abs/1906.03402, 2019.
- “Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis,” in Proc. Interspeech 2022, 2022, pp. 4591–4595.
- “Do prosody transfer models transfer prosody?,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 5180–5189.
- “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, Eds., 2022.
- “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831.
- “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
- “Prompttts: Controllable text-to-speech with text descriptions,” arXiv preprint arXiv:2211.12171, 2022.
- “Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,” arXiv e-prints, pp. arXiv–2301, 2023.
- “Language model-based emotion prediction methods for emotional speech synthesis systems,” in Interspeech, 2022.
- “Fastspeech: Fast, robust and controllable text to speech,” Advances in neural information processing systems, vol. 32, 2019.
- “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6189–6193.
- “Hierarchical generative modeling for controllable speech synthesis,” in International Conference on Learning Representations, 2019.
- “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2020.
- “Rethinking the role of demonstrations: What makes in-context learning work?,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 11048–11064, Association for Computational Linguistics.
- “Chain of thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, Eds., 2022.
- “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
- Atli Thor Sigurgeirsson (2 papers)
- Simon King (28 papers)