Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing audio quality for expressive Neural Text-to-Speech (2108.06270v1)

Published 13 Aug 2021 in eess.AS and cs.AI

Abstract: Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Abdelhamid Ezzerg (8 papers)
  2. Adam Gabrys (8 papers)
  3. Bartosz Putrycz (8 papers)
  4. Daniel Korzekwa (21 papers)
  5. David McHardy (2 papers)
  6. Kamil Pokora (8 papers)
  7. Jakub Lachowicz (2 papers)
  8. Jaime Lorenzo-Trueba (33 papers)
  9. Viacheslav Klimkov (10 papers)
  10. Daniel Saez-Trigueros (1 paper)
Citations (6)