Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data (2402.18932v2)
Abstract: Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-text encoder pretraining with unsupervised training using untranscribed speech and unspoken text data sources, thereby leveraging massively multilingual joint speech and text representation learning. Without any transcribed speech in a new language, this TTS model can generate intelligible speech in >30 unseen languages (CER difference of <10% to ground truth). With just 15 minutes of transcribed, found data, we can reduce the intelligibility difference to 1% or less from the ground-truth, and achieve naturalness scores that match the ground-truth in several languages.
- “Almost unsupervised text to speech and automatic speech recognition,” in ICML, 2019, pp. 5410–5419.
- “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” ICASSP, pp. 6940–6944, 2019.
- H. Zhang and Y. Lin, “Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages,” Interspeech, pp. 3161–3165, 2020.
- “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” arXiv:2302.03540, 2023.
- “Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition,” in Interspeech, 2022, pp. 461–465.
- “Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text pretraining,” in IJCAI, 2023, pp. 5179–5187.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXiv:2006.11477, 2020.
- “w2v-BERT: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” ASRU, pp. 244–250, 2021.
- “Self-supervised learning with random-projection quantizer for speech recognition,” in ICML, 2022, pp. 3915–3924.
- “mSLAM: Massively multilingual joint pre-training for speech and text,” arXiv:2202.01374, 2022.
- “MAESTRO: Matched speech text representations through modality matching,” in Interspeech, 2022, pp. 4093–4097.
- “Maestro-U: leveraging joint speech–text representation learning for zero supervised speech ASR,” arXiv:2210.10027, 2022.
- “DelightfulTTS 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders,” arXiv:2207.04646, 2022.
- “A vector quantized approach for text to speech synthesis on real-world spontaneous speech,” arXiv:2302.04215, 2023.
- “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Interspeech, 2022, pp. 833–837.
- “Neural codec language models are zero-shot text to speech synthesizers,” arXiv:2301.02111, 2023.
- “MLS: A large-scale multilingual dataset for speech research,” arXiv:2012.03411, 2019.
- “Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED,” in SLTU, 2014, pp. 16––23.
- “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv:2101.00390, 2021.
- “FLEURS: Few-shot learning evaluation of universal representations of speech,” arXiv:2205.12446, 2022.
- “Statistical parametric speech synthesis based on speaker and language factorization,” IEEE TASLP, vol. 20, no. 6, pp. 1713–1724, 2012.
- B. Li and H. Zen, “Multi-language multi-speaker acoustic modeling for LSTM-RNN based statistical parametric speech synthesis,” in Interspeech, 2016, pp. 2468–2472.
- “Multilingual Byte2Speech models for scalable low-resource speech synthesis,” arXiv:2103.03541, 2021.
- “Virtuoso: Massive multilingual speech-text joint semi-supervised learning for text-to-speech,” arXiv:2210.15447, 2022.
- F. Lux and N. T. Vu, “Language-agnostic meta-learning for low-resource text-to-speech with articulatory features,” in ACL, 2022, pp. 6858–6868.
- “Scaling speech technology to 1,000+ languages,” arXiv:2305.13516, 2023.
- “Conformer: Convolution-augmented transformer for speech recognition,” arXiv:2005.08100, 2020.
- “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv:2303.01037, 2023.
- “WaveFit: An iterative and non-autoregressive neural vocoder based on fixed-point iteration,” in SLT, 2023, pp. 884–891.
- “Listening while speaking: Speech chain by deep learning,” in ASRU, 2017, pp. 301–308.
- “Parallel tacotron: Non-autoregressive and controllable tts,” in ICASSP, 2021, pp. 5709–5713.
- “Parallel Tacotron 2: A non-autoregressive neural tts model with differentiable duration modeling,” in Interspeech, 2021, pp. 141–145.
- J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv:2207.12598, 2022.
- “Unsupervised cross-lingual representation learning for speech recognition,” arXiv:2006.13979, 2020.
- S. Wu and M. Dredze, “Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT,” in EMNLP-IJCNLP, 2019, pp. 833–844.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019, pp. 4171–4186.
- “Mass: Masked sequence to sequence pre-training for language generation,” arXiv:1905.02450, 2019.
- “mT5: A massively multilingual pre-trained text-to-text transformer,” arXiv:2010.11934, 2020.
- “SQuId: Measuring speech naturalness in many languages,” arXiv:2210.06324, 2022.
- Takaaki Saeki (22 papers)
- Gary Wang (19 papers)
- Nobuyuki Morioka (8 papers)
- Isaac Elias (5 papers)
- Kyle Kastner (18 papers)
- Andrew Rosenberg (32 papers)
- Bhuvana Ramabhadran (47 papers)
- Heiga Zen (36 papers)
- Françoise Beaufays (60 papers)
- Hadar Shemtov (3 papers)
- Fadi Biadsy (11 papers)