Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech (2210.15447v2)

Published 27 Oct 2022 in cs.SD, cs.CL, and eess.AS

Abstract: This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Takaaki Saeki (22 papers)
  2. Heiga Zen (36 papers)
  3. Zhehuai Chen (39 papers)
  4. Nobuyuki Morioka (8 papers)
  5. Gary Wang (19 papers)
  6. Yu Zhang (1400 papers)
  7. Ankur Bapna (53 papers)
  8. Andrew Rosenberg (32 papers)
  9. Bhuvana Ramabhadran (47 papers)
Citations (19)

Summary

We haven't generated a summary for this paper yet.