Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech (2106.12896v2)

Published 24 Jun 2021 in cs.SD, cs.AI, and cs.LG

Abstract: Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, a 3-step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker. Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker similarity. Further, we match the naturalness and speaker similarity of a Tacotron2-based full-data (~10 hours) model using only 15 minutes of target speaker data, whereas with 30 minutes or more, we significantly outperform it. The following improvements are proposed: 1) changing from an autoregressive, attention-based TTS model to a non-autoregressive model replacing attention with an external duration model and 2) an additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Raahil Shah (4 papers)
  2. Kamil Pokora (8 papers)
  3. Abdelhamid Ezzerg (8 papers)
  4. Viacheslav Klimkov (10 papers)
  5. Goeric Huybrechts (15 papers)
  6. Bartosz Putrycz (8 papers)
  7. Daniel Korzekwa (21 papers)
  8. Thomas Merritt (16 papers)
Citations (23)