Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Expressive paragraph text-to-speech synthesis with multi-step variational autoencoder (2308.13365v4)

Published 25 Aug 2023 in cs.SD and eess.AS

Abstract: Neural networks have been able to generate high-quality single-sentence speech. However, it remains a challenge concerning audio-book speech synthesis due to the intra-paragraph correlation of semantic and acoustic features as well as variable styles. In this paper, we propose a highly expressive paragraph speech synthesis system with a multi-step variational autoencoder, called EP-MSTTS. EP-MSTTS is the first VITS-based paragraph speech synthesis model and models the variable style of paragraph speech at five levels: frame, phoneme, word, sentence, and paragraph. We also propose a series of improvements to enhance the performance of this hierarchical model. In addition, we directly train EP-MSTTS on speech sliced by paragraph rather than sentence. Experiment results on the single-speaker French audiobook corpus released at Blizzard Challenge 2023 show EP-MSTTS obtains better performance than baseline models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xuyuan Li (5 papers)
  2. Zengqiang Shang (7 papers)
  3. Hua Hua (8 papers)
  4. Peiyang Shi (5 papers)
  5. Pengyuan Zhang (57 papers)
  6. Ta Li (6 papers)

Summary

We haven't generated a summary for this paper yet.