Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis (2302.14370v2)

Published 28 Feb 2023 in cs.SD, cs.AI, eess.AS, and eess.SP

Abstract: While recent text-to-speech (TTS) systems have made remarkable strides toward human-level quality, the performance of cross-lingual TTS lags behind that of intra-lingual TTS. This gap is mainly rooted from the speaker-language entanglement problem in cross-lingual TTS. In this paper, we propose CrossSpeech which improves the quality of cross-lingual speech by effectively disentangling speaker and language information in the level of acoustic feature space. Specifically, CrossSpeech decomposes the speech generation pipeline into the speaker-independent generator (SIG) and speaker-dependent generator (SDG). The SIG produces the speaker-independent acoustic representation which is not biased to specific speaker distributions. On the other hand, the SDG models speaker-dependent speech variation that characterizes speaker attributes. By handling each information separately, CrossSpeech can obtain disentangled speaker and language representations. From the experiments, we verify that CrossSpeech achieves significant improvements in cross-lingual TTS, especially in terms of speaker similarity to the target speaker.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 4779–4783.
  2. Adrian Łańcucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6588–6592.
  3. “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning, 2021, pp. 5530–5540.
  4. “Naturalspeech: End-to-end text to speech synthesis with human-level quality,” arXiv:2205.04421, 2022.
  5. “TriniTTS: Pitch-controllable end-to-end TTS without external aligner,” in Proc. Interspeech, 2022, pp. 16–20.
  6. “Cross-lingual text-to-speech synthesis via domain adaptation and perceptual similarity regression in speaker space,” in Proc. Interspeech, 2020, pp. 2947–2951.
  7. “Improve cross-lingual text-to-speech synthesis on monolingual corpora with pitch contour information,” in Proc. Interspeech, 2021, pp. 1599–1603.
  8. “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 5621–5625.
  9. “Phonological features for 0-shot multilingual speech synthesis,” in Proc. Interspeech, 2020, pp. 2942–2946.
  10. “Language-agnostic meta-learning for low-resource text-to-speech with articulatory features,” in Proceedings of the 60th annual meeting of the association for computational linguistics, 2022.
  11. “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” in Proc. Interspeech, 2019, pp. 2080–2084.
  12. “Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6608–6612.
  13. “SANE-TTS: stable and natural end-to-end multilingual text-to-speech,” in Proc. Interspeech, 2022, pp. 1–5.
  14. “One TTS alignment to rule them all,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 6092–6096.
  15. “Feature-critic networks for heterogeneous domain generalization,” in International Conference on Machine Learning, 2019, pp. 3915–3924.
  16. “Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech synthesis,” arXiv:2205.07211, 2022.
  17. “PVAE-TTS: Adaptive text-to-speech via progressive style adaptation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 6312–6316.
  18. “Domain generalization with mixstyle,” in Proc. International Conference on Learning Representation, 2021.
  19. “Style neophile: constantly seeking novel styles for domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7130–7140.
  20. “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  21. “Phonemizer: Text to phones transcription for multiple languages in python,” Journal of Open Source Software, vol. 6, no. 68, pp. 3958, 2021.
  22. “Large batch optimization for deep learning: training BERT in 76 minutes,” in Proc. International Conference on Learning Representation, 2020.
  23. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  24. “Fre-GAN: Adversarial frequency-consistent audio synthesis,” in Proc. Interspeech, 2021, pp. 2197–2201.
  25. “Pushing the limits of raw waveform speaker recognition,” in Proc. Interspeech, 2022.
  26. “X-vectors: Robust DNN embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333.
  27. Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-SNE,” Journal of machine learning research, vol. 9, no. 11, 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ji-Hoon Kim (65 papers)
  2. Hong-Sun Yang (3 papers)
  3. Yoon-Cheol Ju (3 papers)
  4. Il-Hwan Kim (3 papers)
  5. Byeong-Yeol Kim (11 papers)
Citations (8)