Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Uncovering Latent Style Factors for Expressive Speech Synthesis (1711.00520v1)

Published 1 Nov 2017 in cs.CL and cs.SD

Abstract: Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual input containing only phonetic information. In this preliminary study, we introduce the concept of "style tokens" in Tacotron, a recently proposed end-to-end neural speech synthesis model. Using style tokens, we aim to extract independent prosodic styles from training data. We show that without annotation data or an explicit supervision signal, our approach can automatically learn a variety of prosodic variations in a purely data-driven way. Importantly, each style token corresponds to a fixed style factor regardless of the given text sequence. As a result, we can control the prosodic style of synthetic speech in a somewhat predictable and globally consistent way.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuxuan Wang (239 papers)
  2. RJ Skerry-Ryan (21 papers)
  3. Ying Xiao (29 papers)
  4. Daisy Stanton (12 papers)
  5. Joel Shor (20 papers)
  6. Eric Battenberg (14 papers)
  7. Rob Clark (10 papers)
  8. Rif A. Saurous (32 papers)
Citations (53)

Summary

We haven't generated a summary for this paper yet.