Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

s-Transformer: Segment-Transformer for Robust Neural Speech Synthesis (2011.08480v1)

Published 17 Nov 2020 in eess.AS and cs.SD

Abstract: Neural end-to-end text-to-speech (TTS) , which adopts either a recurrent model, e.g. Tacotron, or an attention one, e.g. Transformer, to characterize a speech utterance, has achieved significant improvement of speech synthesis. However, it is still very challenging to deal with different sentence lengths, particularly, for long sentences where sequence model has limitation of the effective context length. We propose a novel segment-Transformer (s-Transformer), which models speech at segment level where recurrence is reused via cached memories for both the encoder and decoder. Long-range contexts can be captured by the extended memory, meanwhile, the encoder-decoder attention on segment which is much easier to handle. In addition, we employ a modified relative positional self attention to generalize sequence length beyond a period possibly unseen in the training data. By comparing the proposed s-Transformer with the standard Transformer, on short sentences, both achieve the same MOS scores of 4.29, which is very close to 4.32 by the recordings; similar scores of 4.22 vs 4.2 on long sentences, and significantly better for extra-long sentences with a gain of 0.2 in MOS. Since the cached memory is updated with time, the s-Transformer generates rather natural and coherent speech for a long period of time.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xi Wang (275 papers)
  2. Huaiping Ming (2 papers)
  3. Lei He (121 papers)
  4. Frank K. Soong (17 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.