Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Controllable Speaking Styles Using a Large Language Model (2305.10321v2)

Published 17 May 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text. Such models jointly learn a latent acoustic space during training, which can be sampled from during inference. Controlling these models during inference typically requires finding an appropriate reference utterance, which is non-trivial. Large generative LLMs have shown excellent performance in various language-related tasks. Given only a natural language query text (the prompt), such models can be used to solve specific, context-dependent tasks. Recent work in TTS has attempted similar prompt-based control of novel speaking style generation. Those methods do not require a reference utterance and can, under ideal conditions, be controlled with only a prompt. But existing methods typically require a prompt-labelled speech corpus for jointly training a prompt-conditioned encoder. In contrast, we instead employ an LLM to directly suggest prosodic modifications for a controllable TTS model, using contextual information provided in the prompt. The prompt can be designed for a multitude of tasks. Here, we give two demonstrations: control of speaking style; prosody appropriate for a given dialogue context. The proposed method is rated most appropriate in 50% of cases vs. 31% for a baseline model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “Relevance and prosody,” Journal of pragmatics, vol. 38, no. 10, pp. 1559–1579, 2006.
  2. Jennifer Cole, “Prosody in context: a review,” Language, Cognition and Neuroscience, vol. 30, no. 1-2, pp. 1–31, 2015.
  3. “Experimental and theoretical advances in prosody: A review,” Language and cognitive processes, vol. 25, no. 7-9, pp. 905–945, 2010.
  4. “Using pupillometry to measure the cognitive load of synthetic speech,” Proc. Interspeech 2018, pp. 2838–2842, 2018.
  5. “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in international conference on machine learning. PMLR, 2018, pp. 4693–4702.
  6. “Effective use of variational embedding capacity in expressive end-to-end speech synthesis,” CoRR, vol. abs/1906.03402, 2019.
  7. “Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis,” in Proc. Interspeech 2022, 2022, pp. 4591–4595.
  8. “Do prosody transfer models transfer prosody?,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  9. “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 5180–5189.
  10. “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  11. “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, Eds., 2022.
  12. “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831.
  13. “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
  14. “Prompttts: Controllable text-to-speech with text descriptions,” arXiv preprint arXiv:2211.12171, 2022.
  15. “Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt,” arXiv e-prints, pp. arXiv–2301, 2023.
  16. “Language model-based emotion prediction methods for emotional speech synthesis systems,” in Interspeech, 2022.
  17. “Fastspeech: Fast, robust and controllable text to speech,” Advances in neural information processing systems, vol. 32, 2019.
  18. “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6189–6193.
  19. “Hierarchical generative modeling for controllable speech synthesis,” in International Conference on Learning Representations, 2019.
  20. “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
  21. “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2020.
  22. “Rethinking the role of demonstrations: What makes in-context learning work?,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 11048–11064, Association for Computational Linguistics.
  23. “Chain of thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, Eds., 2022.
  24. “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  25. “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
  26. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Atli Thor Sigurgeirsson (2 papers)
  2. Simon King (28 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.