Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization (2402.01692v1)

Published 23 Jan 2024 in cs.CL and cs.LG

Abstract: This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems, with a focus on achieving language adaptation using minimal labeled and unlabeled data. While many works focus on reducing the usage of labeled data, very few consider minimizing the usage of unlabeled data. By utilizing self-supervised features in the pretraining stage, replacing the noisy portion of pseudo labels with these features during fine-tuning, and incorporating an embedding initialization trick, our method leverages more information from unlabeled data compared to conventional approaches. Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data. Our methodology continues to surpass conventional techniques, even when a greater volume of data is accessible. These findings highlight the potential of our data-efficient language adaptation framework.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” 2021.
  2. “Portaspeech: Portable and high-quality generative text-to-speech,” 2022.
  3. “Tacotron: Towards End-to-End Speech Synthesis,” Proc. Interspeech 2017, pp. 4006–4010, 2017.
  4. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–4783.
  5. “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning,” in International Conference on Learning Representations, 2018.
  6. “Char2Wav: End-to-End Speech Synthesis,” in ICLR (Workshop), 2017.
  7. “Towards Universal Text-to-Speech.,” in INTERSPEECH, 2020, pp. 3171–3175.
  8. “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5621–5625.
  9. “Almost unsupervised text to speech and automatic speech recognition,” in International Conference on Machine Learning. PMLR, 2019, pp. 5410–5419.
  10. “Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition,” 2022.
  11. “Lrspeech: Extremely low-resource speech synthesis and recognition,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 2802–2812.
  12. “Bag of tricks for unsupervised text-to-speech,” in The Eleventh International Conference on Learning Representations, 2023.
  13. “Few Shot Cross-Lingual TTS Using Transferable Phoneme Embedding,” in Proc. Interspeech 2022, 2022, pp. 4566–4570.
  14. “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” 2018.
  15. “Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages,” 2020.
  16. “Transfer learning framework for low-resource text-to-speech using a large-scale unlabeled speech corpus,” in Interspeech 2022. sep 2022, ISCA.
  17. “End-to-End Text-to-Speech for Low-Resource Languages by Cross-Lingual Transfer Learning,” Proc. Interspeech 2019, pp. 2075–2079, 2019.
  18. “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4037–4058, 2021.
  19. “Audio self-supervised learning: A survey,” Patterns, vol. 3, no. 12, pp. 100616, 2022.
  20. “Virtuoso: Massive multilingual speech-text joint semi-supervised learning for text-to-speech,” 2023.
  21. “Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text pretraining,” 2023.
  22. “Unsupervised speech recognition,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, Eds. 2021, vol. 34, pp. 27826–27839, Curran Associates, Inc.
  23. “Towards end-to-end unsupervised speech recognition,” 2022.
  24. Alexander Gutkin, “Uniform Multilingual Multi-Speaker Acoustic Model for Statistical Parametric Speech Synthesis of Low-Resourced Languages,” Proc. Interspeech 2017, pp. 2183–2187, 2017.
  25. “Areal and Phylogenetic Features for Multilingual Speech Synthesis,” Proc. Interspeech 2017, pp. 2078–2082, 2017.
  26. “A Unified Phonological Representation of South Asian Languages for Multilingual Text-to-Speech,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, pp. 80–84.
  27. “Cross-lingual Low Resource Speaker Adaptation Using Phonological Features,” arXiv preprint arXiv:2111.09075, 2021.
  28. Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet, Cambridge University Press, 1999.
  29. “Multilingual byte2speech models for scalable low-resource speech synthesis,” arXiv preprint arXiv:2103.03541, 2021.
  30. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020.
  31. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” 2021.
  32. “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.
  33. “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” arXiv preprint arXiv:2010.11567, 2020.
  34. “Css10: A collection of single speaker speech datasets for 10 languages,” arXiv preprint arXiv:1903.11269, 2019.
  35. “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis,” arXiv preprint arXiv:1711.00354, 2017.
  36. “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi.,” in Interspeech, 2017, vol. 2017, pp. 498–502.
  37. “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
  38. “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  39. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: