Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech (2204.02172v2)

Published 5 Apr 2022 in cs.SD and eess.AS

Abstract: To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hyungchan Yoon (3 papers)
  2. Seyun Um (4 papers)
  3. Changwhan Kim (1 paper)
  4. Hong-Goo Kang (36 papers)

Summary

We haven't generated a summary for this paper yet.