ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations (2312.14398v2)
Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.
- Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech. ISCA, 2017, pp. 4006–4010.
- Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
- J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions,” in Proc. ICASSP. IEEE, 2018, pp. 4779–4783.
- X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He et al., “NaturalSpeech: End-to-end text to speech synthesis with human-level quality,” arXiv preprint arXiv:2205.04421, 2022.
- T. Saeki, S. Maiti, X. Li, S. Watanabe, S. Takamichi, and H. Saruwatari, “Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text pretraining,” in Proc. IJCAI, 2023.
- Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. J. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” in Proc. Interspeech. ISCA, 2019, pp. 2080–2084.
- T. Nekvinda and O. Dusek, “One model, many languages: Meta-learning for multilingual text-to-speech,” in Proc. Interspeech. ISCA, 2020, pp. 2972–2976.
- E. Casanova, J. Weber, C. D. Shulby, A. C. Júnior, E. Gölge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. ICML, vol. 162. PMLR, 2022, pp. 2709–2720.
- Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “PnG BERT: Augmented BERT on phonemes and graphemes for neural TTS,” Proc. Interspeech, pp. 151–155, 2021.
- G. Zhang, K. Song, X. Tan, D. Tan, Y. Yan, Y. Liu, G. Wang, W. Zhou, T. Qin, T. Lee et al., “Mixed-phoneme BERT: Improving bert with mixed phoneme and sup-phoneme representations for text to speech,” in Proc. Interspeech, 2022, pp. 456–460.
- Y. A. Li, C. Han, X. Jiang, and N. Mesgarani, “Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme predictions,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- L. T. Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,” in Proc. Interspeech, 2023, pp. 5506–5510.
- Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, z. Chen, P. Nguyen, R. Pang, I. Lopez Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Proc. NIPS, vol. 31, 2018, pp. 4480–4490.
- E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. ICASSP, 2020, pp. 6184–6188.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NIPS, 2020, pp. 12 449–12 460.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- B. Thomas, S. Kessler, and S. Karout, “Efficient adapter transfer of self-supervised speech models for automatic speech recognition,” in Proc. ICASSP, 2022, pp. 7102–7106.
- Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” in Proc. ICASSP. IEEE, 2022, pp. 6147–6151.
- H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” in Proc. NIPS, vol. 34, 2021, pp. 16 251–16 265.
- W.-C. Huang, S.-W. Yang, T. Hayashi, H.-Y. Lee, S. Watanabe, and T. Toda, “S3PRL-VC: Open-source voice conversion framework with self-supervised speech representations,” in Proc. ICASSP, 2022, pp. 6552–6556.
- C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
- S. Liu, Y. Guo, C. Du, X. Chen, and K. Yu, “DSE-TTS: Dual speaker embedding for cross-lingual text-to-speech,” in Proc. Interspeech, 2023, pp. 616–620.
- A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in Proc. ASRU, 2021, pp. 914–921.
- S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Proc. NIPS, vol. 31, 2018.
- Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc. ICML. PMLR, 2018, pp. 5180–5189.
- E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model,” in Proc. Interspeech, 2021, pp. 3645–3649.
- D. Xin, T. Komatsu, S. Takamichi, and H. Saruwatari, “Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS,” in Proc. ICASSP. IEEE, 2021, pp. 6608–6612.
- J. Yang and L. He, “Cross-lingual text-to-speech using multi-task learning and speaker classifier joint training,” arXiv preprint arXiv:2201.08124, 2022.
- N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis with Transformer network,” in Proc. AAAI, 2019, pp. 6706–6713.
- Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
- A. Lancucki, “FastPitch: Parallel text-to-speech with pitch prediction,” in Proc. ICASSP. IEEE, 2021, pp. 6588–6592.
- J. Yang and L. He, “Towards universal text-to-speech,” in Proc. Interspeech, 2020, pp. 3171–3175.
- A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in ISCA, September 2016. ISCA, 2016, p. 125.
- R. Badlani, R. Valle, K. J. Shih, J. F. Santos, S. Gururani, and B. Catanzaro, “RAD-MMM: Multilingual multiaccented multispeaker text to speech,” in Proc. Interspeech, 2023, pp. 626–630.
- K. J. Shih, R. Valle, R. Badlani, A. Lancucki, W. Ping, and B. Catanzaro, “RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis,” in Proc. ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
- H. Cho, W. Jung, J. Lee, and S. H. Woo, “SANE-TTS: Stable and natural end-to-end multilingual text-to-speech,” in Proc. Interspeech. ISCA, 2022.
- J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML. PMLR, 2021, pp. 5530–5540.
- M. Chen, M. Chen, S. Liang, J. Ma, L. Chen, S. Wang, and J. Xiao, “Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding,” in Proc. Interspeech. ISCA, 2019, pp. 2105–2109.
- B. Li, Y. Zhang, T. N. Sainath, Y. Wu, and W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proc. ICASSP. IEEE, 2019, pp. 5621–5625.
- D. Wells and K. Richmond, “Cross-lingual transfer of phonological features for low-resource speech synthesis,” in Proc. SSW, 2021, pp. 160–165.
- D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” in Proc. LREC, Paris, France, May 2018.
- Y. Wang, J. Li, H. Wang, Y. Qian, C. Wang, and Y. Wu, “Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition,” in Proc. ICASSP. IEEE, 2022, pp. 7097–7101.
- H. Siuzdak, P. Dura, P. van Rijn, and N. Jacoby, “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Proc. Interspeech. ISCA, 2022, pp. 833–837.
- J.-h. Lin, Y. Y. Lin, C.-M. Chien, and H.-y. Lee, “S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations,” in Proc. Interspeech, 2021, pp. 836–840.
- S. Chen, Y. Wu, C. Wang, S. Liu, Z. Chen, P. Wang, G. Liu, J. Li, J. Wu, X. Yu, and F. Wei, “Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?” in Proc. Interspeech. ISCA, Sep. 2022, pp. 3699–3703.
- Y.-J. Zhang, C. Zhang, W. Song, Z. Zhang, Y. Wu, and X. He, “Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2812–2823, 2023.
- A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech. ISCA, Aug. 2021, pp. 2426–2430.
- C. Liu, Z.-H. Ling, and L.-H. Chen, “Pronunciation dictionary-free multilingual speech synthesis by combining unsupervised and supervised phonetic representations,” in Proc. Interspeech, 2022, pp. 4282–4286.
- D. Wells, K. Richmond, and W. Lamb, “A Low-Resource Pipeline for Text-to-Speech from Found Data With Application to Scottish Gaelic,” in Proc. INTERSPEECH 2023, 2023, pp. 4324–4328.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NIPS, vol. 33. Curran Associates, Inc., 2020, pp. 17 022–17 033.
- R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and B. Catanzaro, “One TTS alignment to rule them all,” in Proc. ICASSP. IEEE, 2022, pp. 6092–6096.
- D. Berrebbi, J. Shi, B. Yan, O. López-Francisco, J. Amith, and S. Watanabe, “Combining spectral and self-supervised features for low resource speech recognition and translation,” in Proc. Interspeech, 2022, pp. 3533–3537.
- H. Guo, F. Xie, F. K. Soong, X. Wu, and H. Meng, “A multi-stage multi-codebook VQ-VAE approach to high-performance neural TTS,” in Proc. Interspeech. ISCA, 2022, pp. 1611–1615.
- H. Guo, F. Xie, X. Wu, F. K. Soong, and H. Meng, “MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1811–1824, 2023.
- W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. INTERSPEECH, 2021, pp. 2207–2211.
- V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” in Proc. Interspeech, 2020, p. 2757–2761.
- T. Schultz, N. T. Vu, and T. Schlippe, “GlobalPhone: A multilingual text & speech database in 20 languages,” in Proc. ICASSP, 2013, pp. 8126–8130.
- K. Park and T. Mulc, “CSS10: A collection of single speaker speech datasets for 10 languages,” in Proc. Interspeech, 2019, pp. 1566–1570.
- K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- N. L. Technology, “NST Swedish speech synthesis,” https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/, 2003.
- I. T. Union. Recommendation G.191: Software Tools and Audio Coding Standardization. (2005, Nov 11). [Online]. Available: https://www.itu.int/rec/T-REC-P.56/en
- H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “CLOVA baseline system for the VoxCeleb Speaker Recognition Challenge 2020,” arXiv preprint arXiv:2009.14153, 2020.
- Data-baker. Chinese Standard Mandarin Speech Copus. (2022, Nov). [Online]. Available: https://www.data-baker.com/open_source.html
- L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE.” Journal of machine learning research, vol. 9, no. 11, 2008.
- P. Do, M. Coler, J. Dijkstra, and E. Klabbers, “Text-to-speech for under-resourced languages: Phoneme mapping and source language selection in transfer learning,” in Proc. ELRA/ISCA SIG on Under-Resourced Languages, 2022, pp. 16–22.
- T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
- Cheng Gong (51 papers)
- Xin Wang (1306 papers)
- Erica Cooper (45 papers)
- Dan Wells (3 papers)
- Longbiao Wang (46 papers)
- Jianwu Dang (41 papers)
- Korin Richmond (23 papers)
- Junichi Yamagishi (178 papers)