Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS (2207.06000v1)

Published 13 Jul 2022 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained LLM. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental results show that our model can generate high-quality expressive speech even in unseen style.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yookyung Shin (2 papers)
  2. Younggun Lee (10 papers)
  3. Suhee Jo (2 papers)
  4. Yeongtae Hwang (3 papers)
  5. Taesu Kim (23 papers)
Citations (14)