Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue (2206.12040v1)

Published 24 Jun 2022 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue. Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kentaro Mitsui (14 papers)
  2. Tianyu Zhao (73 papers)
  3. Kei Sawada (16 papers)
  4. Yukiya Hono (13 papers)
  5. Yoshihiko Nankaku (11 papers)
  6. Keiichi Tokuda (11 papers)
Citations (14)