Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning (2310.17101v2)

Published 26 Oct 2023 in eess.AS and cs.SD

Abstract: This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinfa Zhu (29 papers)
  2. Yuke Li (76 papers)
  3. Yi Lei (40 papers)
  4. Ning Jiang (177 papers)
  5. Guoqing Zhao (20 papers)
  6. Lei Xie (337 papers)