Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization (2002.01953v1)

Published 4 Feb 2020 in eess.AS, cs.LG, cs.SD, and stat.ML

Abstract: We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Henry B. Moss (17 papers)
  2. Vatsal Aggarwal (5 papers)
  3. Nishant Prateek (3 papers)
  4. Javier González (44 papers)
  5. Roberto Barra-Chicote (24 papers)
Citations (56)

Summary

We haven't generated a summary for this paper yet.