Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations (2312.14398v2)

Published 22 Dec 2023 in cs.SD and eess.AS

Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech.   ISCA, 2017, pp. 4006–4010.
  2. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
  3. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions,” in Proc. ICASSP.   IEEE, 2018, pp. 4779–4783.
  4. X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He et al., “NaturalSpeech: End-to-end text to speech synthesis with human-level quality,” arXiv preprint arXiv:2205.04421, 2022.
  5. T. Saeki, S. Maiti, X. Li, S. Watanabe, S. Takamichi, and H. Saruwatari, “Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text pretraining,” in Proc. IJCAI, 2023.
  6. Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. J. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” in Proc. Interspeech.   ISCA, 2019, pp. 2080–2084.
  7. T. Nekvinda and O. Dusek, “One model, many languages: Meta-learning for multilingual text-to-speech,” in Proc. Interspeech.   ISCA, 2020, pp. 2972–2976.
  8. E. Casanova, J. Weber, C. D. Shulby, A. C. Júnior, E. Gölge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. ICML, vol. 162.   PMLR, 2022, pp. 2709–2720.
  9. Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “PnG BERT: Augmented BERT on phonemes and graphemes for neural TTS,” Proc. Interspeech, pp. 151–155, 2021.
  10. G. Zhang, K. Song, X. Tan, D. Tan, Y. Yan, Y. Liu, G. Wang, W. Zhou, T. Qin, T. Lee et al., “Mixed-phoneme BERT: Improving bert with mixed phoneme and sup-phoneme representations for text to speech,” in Proc. Interspeech, 2022, pp. 456–460.
  11. Y. A. Li, C. Han, X. Jiang, and N. Mesgarani, “Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme predictions,” in Proc. ICASSP.   IEEE, 2023, pp. 1–5.
  12. L. T. Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,” in Proc. Interspeech, 2023, pp. 5506–5510.
  13. Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, z. Chen, P. Nguyen, R. Pang, I. Lopez Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Proc. NIPS, vol. 31, 2018, pp. 4480–4490.
  14. E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. ICASSP, 2020, pp. 6184–6188.
  15. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NIPS, 2020, pp. 12 449–12 460.
  16. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  17. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  18. B. Thomas, S. Kessler, and S. Karout, “Efficient adapter transfer of self-supervised speech models for automatic speech recognition,” in Proc. ICASSP, 2022, pp. 7102–7106.
  19. Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” in Proc. ICASSP.   IEEE, 2022, pp. 6147–6151.
  20. H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” in Proc. NIPS, vol. 34, 2021, pp. 16 251–16 265.
  21. W.-C. Huang, S.-W. Yang, T. Hayashi, H.-Y. Lee, S. Watanabe, and T. Toda, “S3PRL-VC: Open-source voice conversion framework with self-supervised speech representations,” in Proc. ICASSP, 2022, pp. 6552–6556.
  22. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  23. S. Liu, Y. Guo, C. Du, X. Chen, and K. Yu, “DSE-TTS: Dual speaker embedding for cross-lingual text-to-speech,” in Proc. Interspeech, 2023, pp. 616–620.
  24. A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in Proc. ASRU, 2021, pp. 914–921.
  25. S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Proc. NIPS, vol. 31, 2018.
  26. Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc. ICML.   PMLR, 2018, pp. 5180–5189.
  27. E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model,” in Proc. Interspeech, 2021, pp. 3645–3649.
  28. D. Xin, T. Komatsu, S. Takamichi, and H. Saruwatari, “Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS,” in Proc. ICASSP.   IEEE, 2021, pp. 6608–6612.
  29. J. Yang and L. He, “Cross-lingual text-to-speech using multi-task learning and speaker classifier joint training,” arXiv preprint arXiv:2201.08124, 2022.
  30. N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis with Transformer network,” in Proc. AAAI, 2019, pp. 6706–6713.
  31. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
  32. A. Lancucki, “FastPitch: Parallel text-to-speech with pitch prediction,” in Proc. ICASSP.   IEEE, 2021, pp. 6588–6592.
  33. J. Yang and L. He, “Towards universal text-to-speech,” in Proc. Interspeech, 2020, pp. 3171–3175.
  34. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in ISCA, September 2016.   ISCA, 2016, p. 125.
  35. R. Badlani, R. Valle, K. J. Shih, J. F. Santos, S. Gururani, and B. Catanzaro, “RAD-MMM: Multilingual multiaccented multispeaker text to speech,” in Proc. Interspeech, 2023, pp. 626–630.
  36. K. J. Shih, R. Valle, R. Badlani, A. Lancucki, W. Ping, and B. Catanzaro, “RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis,” in Proc. ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
  37. H. Cho, W. Jung, J. Lee, and S. H. Woo, “SANE-TTS: Stable and natural end-to-end multilingual text-to-speech,” in Proc. Interspeech.   ISCA, 2022.
  38. J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML.   PMLR, 2021, pp. 5530–5540.
  39. M. Chen, M. Chen, S. Liang, J. Ma, L. Chen, S. Wang, and J. Xiao, “Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding,” in Proc. Interspeech.   ISCA, 2019, pp. 2105–2109.
  40. B. Li, Y. Zhang, T. N. Sainath, Y. Wu, and W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proc. ICASSP.   IEEE, 2019, pp. 5621–5625.
  41. D. Wells and K. Richmond, “Cross-lingual transfer of phonological features for low-resource speech synthesis,” in Proc. SSW, 2021, pp. 160–165.
  42. D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” in Proc. LREC, Paris, France, May 2018.
  43. Y. Wang, J. Li, H. Wang, Y. Qian, C. Wang, and Y. Wu, “Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition,” in Proc. ICASSP.   IEEE, 2022, pp. 7097–7101.
  44. H. Siuzdak, P. Dura, P. van Rijn, and N. Jacoby, “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Proc. Interspeech.   ISCA, 2022, pp. 833–837.
  45. J.-h. Lin, Y. Y. Lin, C.-M. Chien, and H.-y. Lee, “S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations,” in Proc. Interspeech, 2021, pp. 836–840.
  46. S. Chen, Y. Wu, C. Wang, S. Liu, Z. Chen, P. Wang, G. Liu, J. Li, J. Wu, X. Yu, and F. Wei, “Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?” in Proc. Interspeech.   ISCA, Sep. 2022, pp. 3699–3703.
  47. Y.-J. Zhang, C. Zhang, W. Song, Z. Zhang, Y. Wu, and X. He, “Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2812–2823, 2023.
  48. A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech.   ISCA, Aug. 2021, pp. 2426–2430.
  49. C. Liu, Z.-H. Ling, and L.-H. Chen, “Pronunciation dictionary-free multilingual speech synthesis by combining unsupervised and supervised phonetic representations,” in Proc. Interspeech, 2022, pp. 4282–4286.
  50. D. Wells, K. Richmond, and W. Lamb, “A Low-Resource Pipeline for Text-to-Speech from Found Data With Application to Scottish Gaelic,” in Proc. INTERSPEECH 2023, 2023, pp. 4324–4328.
  51. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NIPS, vol. 33.   Curran Associates, Inc., 2020, pp. 17 022–17 033.
  52. R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and B. Catanzaro, “One TTS alignment to rule them all,” in Proc. ICASSP.   IEEE, 2022, pp. 6092–6096.
  53. D. Berrebbi, J. Shi, B. Yan, O. López-Francisco, J. Amith, and S. Watanabe, “Combining spectral and self-supervised features for low resource speech recognition and translation,” in Proc. Interspeech, 2022, pp. 3533–3537.
  54. H. Guo, F. Xie, F. K. Soong, X. Wu, and H. Meng, “A multi-stage multi-codebook VQ-VAE approach to high-performance neural TTS,” in Proc. Interspeech.   ISCA, 2022, pp. 1611–1615.
  55. H. Guo, F. Xie, X. Wu, F. K. Soong, and H. Meng, “MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1811–1824, 2023.
  56. W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. INTERSPEECH, 2021, pp. 2207–2211.
  57. V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” in Proc. Interspeech, 2020, p. 2757–2761.
  58. T. Schultz, N. T. Vu, and T. Schlippe, “GlobalPhone: A multilingual text & speech database in 20 languages,” in Proc. ICASSP, 2013, pp. 8126–8130.
  59. K. Park and T. Mulc, “CSS10: A collection of single speaker speech datasets for 10 languages,” in Proc. Interspeech, 2019, pp. 1566–1570.
  60. K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  61. N. L. Technology, “NST Swedish speech synthesis,” https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/, 2003.
  62. I. T. Union. Recommendation G.191: Software Tools and Audio Coding Standardization. (2005, Nov 11). [Online]. Available: https://www.itu.int/rec/T-REC-P.56/en
  63. H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “CLOVA baseline system for the VoxCeleb Speaker Recognition Challenge 2020,” arXiv preprint arXiv:2009.14153, 2020.
  64. Data-baker. Chinese Standard Mandarin Speech Copus. (2022, Nov). [Online]. Available: https://www.data-baker.com/open_source.html
  65. L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE.” Journal of machine learning research, vol. 9, no. 11, 2008.
  66. P. Do, M. Coler, J. Dijkstra, and E. Klabbers, “Text-to-speech for under-resourced languages: Phoneme mapping and source language selection in transfer learning,” in Proc. ELRA/ISCA SIG on Under-Resourced Languages, 2022, pp. 16–22.
  67. T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Cheng Gong (51 papers)
  2. Xin Wang (1306 papers)
  3. Erica Cooper (45 papers)
  4. Dan Wells (3 papers)
  5. Longbiao Wang (46 papers)
  6. Jianwu Dang (41 papers)
  7. Korin Richmond (23 papers)
  8. Junichi Yamagishi (178 papers)
Citations (17)
X Twitter Logo Streamline Icon: https://streamlinehq.com