Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapting TTS models For New Speakers using Transfer Learning (2110.05798v2)

Published 12 Oct 2021 in cs.SD, cs.CL, and eess.AS

Abstract: Training neural text-to-speech (TTS) models for a new speaker typically requires several hours of high quality speech data. Prior works on voice cloning attempt to address this challenge by adapting pre-trained multi-speaker TTS models for a new voice, using a few minutes of speech data of the new speaker. However, publicly available large multi-speaker datasets are often noisy, thereby resulting in TTS models that are not suitable for use in products. We address this challenge by proposing transfer-learning guidelines for adapting high quality single-speaker TTS models for a new speaker, using only a few minutes of speech data. We conduct an extensive study using different amounts of data for a new speaker and evaluate the synthesized speech in terms of naturalness and voice/style similarity to the target speaker. We find that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of data for both male and female target speakers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Paarth Neekhara (21 papers)
  2. Jason Li (91 papers)
  3. Boris Ginsburg (111 papers)
Citations (13)