Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages (2303.15669v1)

Published 28 Mar 2023 in eess.AS, cs.AI, and cs.LG

Abstract: Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences. In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods. The code and audio samples are available at: https://github.com/cnaigithub/SpeechDewarping

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Seongyeon Park (5 papers)
  2. Myungseo Song (4 papers)
  3. Bohyung Kim (2 papers)
  4. Tae-Hyun Oh (75 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.