Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining (2301.12596v3)

Published 30 Jan 2023 in eess.AS and cs.CL

Abstract: While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual LLMs, our framework first performs masked LLM pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Takaaki Saeki (22 papers)
  2. Soumi Maiti (26 papers)
  3. Xinjian Li (26 papers)
  4. Shinji Watanabe (416 papers)
  5. Shinnosuke Takamichi (70 papers)
  6. Hiroshi Saruwatari (100 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.