Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data (2402.18932v2)

Published 29 Feb 2024 in eess.AS and cs.SD

Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.

Extending Multilingual Speech Synthesis to Languages Beyond Transcribed Data

Introduction

The development of text-to-speech (TTS) systems typically favors languages with an abundance of high-quality transcribed audio data. This situation presents a limitation, given the nearly 6,000 languages worldwide, many of which are considered low-resource due to the scarcity of such data. This paper introduces a novel framework that effectively expands TTS capabilities to over 100 languages, significantly increasing language coverage by utilizing unsupervised learning with untranscribed found data. This approach leverages a pretrained self-supervised multilingual speech foundation model for joint speech-text representation learning, demonstrating the ability to generate intelligible speech in languages without previously available transcribed speech data.

Related Work

Past efforts in multilingual TTS have been constrained by the availability of high-quality, paired speech-text data, limiting the scalability and applicability of TTS systems across the wide spectrum of global languages. Although some strategies have sought to alleviate data requirements through unpaired or synthetic training materials, these have often resulted in models with limited language coverage or compromised performance. By incorporating unsupervised learning strategies and leveraging advances in self-supervised speech pretraining and speech-text joint pretraining, this paper positions itself at the forefront of efforts to universalize TTS technology.

Proposed Framework

At the heart of the proposed solution is a joint multilingual speech-text model comprising several components designed to facilitate both supervised and unsupervised learning across languages. The framework employs a pretrained speech-to-feature (S2F) block and feature-to-speech (F2S) components, alongside novel training objectives suited for TTS language expansion. Crucially, this model introduces methods for leveraging found data—comprising speech-text paired data, untranscribed speech data, and unspoken text data—bypassing the need for curated datasets.

Training Objectives

The model is trained using a mixture of transcribed (paired) speech, untranscribed speech, and unspoken text data, enabling it to learn from a diversity of inputs. The training leverages RNN-T decoder alignments, feature loss, and duration prediction to optimize the model's performance across languages. A key innovation lies in the use of pseudo-labeling for untranscribed speech and aligned text MLM for unspoken text, enabling effective learning even in the absence of transcribed speech data.

Curriculum Training Procedures

The model employs a stage-wise training approach, beginning with the pretraining of speech and shared encoders, followed by targeted training of the shared encoder and the RNN-T decoder. The final stage involves joint training that integrates the supervised and unsupervised learning derived from various data types, refining the model's ability to generalize across languages.

Experimental Setting

The experimental framework underscores the model's applicability to a broad array of languages, demonstrating significant improvements in TTS quality. By training on diverse datasets spanning 100+ languages, and leveraging both public corpora and proprietary datasets, the paper showcases the model's robustness and versatility.

Results

The evaluation of the model reveals promising outcomes, particularly in generating intelligible speech from untranscribed data in more than 30 languages. When minimal supervised data (around 15 minutes of transcribed found data) is incorporated, the intelligibility and naturalness scores closely match those of ground-truth data in several languages. It's a noteworthy achievement that illustrates the model's capacity to significantly reduce the gap between high-resource and low-resource languages in TTS applications.

Conclusion

This paper introduces a transformative approach to multilingual TTS development that dramatically increases language coverage without relying on extensively curated datasets. By harnessing unsupervised learning techniques alongside a novel joint speech-text model, the framework facilitates the generation of high-quality speech across a vast array of languages. Looking ahead, the implications for global communication and access to information are profound, offering a pathway to more inclusive and equitable technology deployment worldwide. Future iterations of this work may explore further optimizations and applications, solidifying the foundation laid by this significant step forward in TTS research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. “Almost unsupervised text to speech and automatic speech recognition,” in ICML, 2019, pp. 5410–5419.
  2. “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” ICASSP, pp. 6940–6944, 2019.
  3. H. Zhang and Y. Lin, “Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages,” Interspeech, pp. 3161–3165, 2020.
  4. “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” arXiv:2302.03540, 2023.
  5. “Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition,” in Interspeech, 2022, pp. 461–465.
  6. “Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text pretraining,” in IJCAI, 2023, pp. 5179–5187.
  7. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXiv:2006.11477, 2020.
  8. “w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” ASRU, pp. 244–250, 2021.
  9. “Self-supervised learning with random-projection quantizer for speech recognition,” in ICML, 2022, pp. 3915–3924.
  10. “mSLAM: Massively multilingual joint pre-training for speech and text,” arXiv:2202.01374, 2022.
  11. “MAESTRO: Matched speech text representations through modality matching,” in Interspeech, 2022, pp. 4093–4097.
  12. “Maestro-U: leveraging joint speech–text representation learning for zero supervised speech ASR,” arXiv:2210.10027, 2022.
  13. “DelightfulTTS 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders,” arXiv:2207.04646, 2022.
  14. “A vector quantized approach for text to speech synthesis on real-world spontaneous speech,” arXiv:2302.04215, 2023.
  15. “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Interspeech, 2022, pp. 833–837.
  16. “Neural codec language models are zero-shot text to speech synthesizers,” arXiv:2301.02111, 2023.
  17. “MLS: A large-scale multilingual dataset for speech research,” arXiv:2012.03411, 2019.
  18. “Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED,” in SLTU, 2014, pp. 16––23.
  19. “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv:2101.00390, 2021.
  20. “FLEURS: Few-shot learning evaluation of universal representations of speech,” arXiv:2205.12446, 2022.
  21. “Statistical parametric speech synthesis based on speaker and language factorization,” IEEE TASLP, vol. 20, no. 6, pp. 1713–1724, 2012.
  22. B. Li and H. Zen, “Multi-language multi-speaker acoustic modeling for LSTM-RNN based statistical parametric speech synthesis,” in Interspeech, 2016, pp. 2468–2472.
  23. “Multilingual Byte2Speech models for scalable low-resource speech synthesis,” arXiv:2103.03541, 2021.
  24. “Virtuoso: Massive multilingual speech-text joint semi-supervised learning for text-to-speech,” arXiv:2210.15447, 2022.
  25. F. Lux and N. T. Vu, “Language-agnostic meta-learning for low-resource text-to-speech with articulatory features,” in ACL, 2022, pp. 6858–6868.
  26. “Scaling speech technology to 1,000+ languages,” arXiv:2305.13516, 2023.
  27. “Conformer: Convolution-augmented transformer for speech recognition,” arXiv:2005.08100, 2020.
  28. “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv:2303.01037, 2023.
  29. “WaveFit: An iterative and non-autoregressive neural vocoder based on fixed-point iteration,” in SLT, 2023, pp. 884–891.
  30. “Listening while speaking: Speech chain by deep learning,” in ASRU, 2017, pp. 301–308.
  31. “Parallel tacotron: Non-autoregressive and controllable tts,” in ICASSP, 2021, pp. 5709–5713.
  32. “Parallel Tacotron 2: A non-autoregressive neural tts model with differentiable duration modeling,” in Interspeech, 2021, pp. 141–145.
  33. J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv:2207.12598, 2022.
  34. “Unsupervised cross-lingual representation learning for speech recognition,” arXiv:2006.13979, 2020.
  35. S. Wu and M. Dredze, “Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT,” in EMNLP-IJCNLP, 2019, pp. 833–844.
  36. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, pp. 4171–4186.
  37. “Mass: Masked sequence to sequence pre-training for language generation,” arXiv:1905.02450, 2019.
  38. “mT5: A massively multilingual pre-trained text-to-text transformer,” arXiv:2010.11934, 2020.
  39. “SQuId: Measuring speech naturalness in many languages,” arXiv:2210.06324, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Takaaki Saeki (22 papers)
  2. Gary Wang (19 papers)
  3. Nobuyuki Morioka (8 papers)
  4. Isaac Elias (5 papers)
  5. Kyle Kastner (18 papers)
  6. Andrew Rosenberg (32 papers)
  7. Bhuvana Ramabhadran (47 papers)
  8. Heiga Zen (36 papers)
  9. Françoise Beaufays (60 papers)
  10. Hadar Shemtov (3 papers)
  11. Fadi Biadsy (11 papers)
Citations (12)