Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring synthetic data for cross-speaker style transfer in style representation based TTS (2409.17364v1)

Published 25 Sep 2024 in eess.AS and cs.SD

Abstract: Incorporating cross-speaker style transfer in text-to-speech (TTS) models is challenging due to the need to disentangle speaker and style information in audio. In low-resource expressive data scenarios, voice conversion (VC) can generate expressive speech for target speakers, which can then be used to train the TTS model. However, the quality and style transfer ability of the VC model are crucial for the overall TTS model quality. In this work, we explore the use of synthetic data generated by a VC model to assist the TTS model in cross-speaker style transfer tasks. Additionally, we employ pre-training of the style encoder using timbre perturbation and prototypical angular loss to mitigate speaker leakage. Our results show that using VC synthetic data can improve the naturalness and speaker similarity of TTS in cross-speaker scenarios. Furthermore, we extend this approach to a cross-language scenario, enhancing accent transfer.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Lucas H. Ueda (2 papers)
  2. Leonardo B. de M. M. Marques (2 papers)
  3. Flávio O. Simões (2 papers)
  4. Mário U. Neto (2 papers)
  5. Fernando Runstein (2 papers)
  6. Bianca Dal Bó (2 papers)
  7. Paula D. P. Costa (4 papers)