Zero-Shot Emotion Transfer For Cross-Lingual Speech Synthesis (2310.03963v1)
Abstract: Zero-shot emotion transfer in cross-lingual speech synthesis aims to transfer emotion from an arbitrary speech reference in the source language to the synthetic speech in the target language. Building such a system faces challenges of unnatural foreign accents and difficulty in modeling the shared emotional expressions of different languages. Building on the DelightfulTTS neural architecture, this paper addresses these challenges by introducing specifically-designed modules to model the language-specific prosody features and language-shared emotional expressions separately. Specifically, the language-specific speech prosody is learned by a non-autoregressive predictive coding (NPC) module to improve the naturalness of the synthetic cross-lingual speech. The shared emotional expression between different languages is extracted from a pre-trained self-supervised model HuBERT with strong generalization capabilities. We further use hierarchical emotion modeling to capture more comprehensive emotions across different languages. Experimental results demonstrate the proposed framework's effectiveness in synthesizing bi-lingual emotional speech for the monolingual target speaker without emotional training data.
- “Delightfultts: The microsoft speech synthesis system for blizzard challenge 2021,” CoRR, vol. abs/2110.12612, 2021.
- “Non-autoregressive predictive coding for learning speech representations from local dependencies,” Proc.Interspeech, 2021.
- Paul Taylor, Text-to-speech synthesis, Cambridge university press, 2009.
- “Tacotron: Towards end-to-end speech synthesis,” Proc.Interspeech 2017, pp. 4006–4010, 2017.
- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” Proc.ICLR 2021, 2020.
- “Adaspeech: Adaptive text to speech for custom voice,” Proc.ICLR 2021, 2021.
- “Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech synthesis,” CoRR, vol. abs/2205.07211, 2022.
- “Cross-lingual speaker adaptation for hmm-based speech synthesis,” in Proc.ISCSLP 2008, 2008, pp. 1–4.
- “Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding.,” in Proc.Interspeech, 2019, pp. 2105–2109.
- “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc.ICASSP, 2020, pp. 6184–6188.
- “Accented text-to-speech synthesis with limited data,” arXiv preprint arXiv:2305.04816, 2023.
- James E Flege, “Second language speech learning: Theory, findings, and problems,” Speech perception and linguistic experience: Issues in cross-language research, vol. 92, pp. 233–277, 1995.
- Catherine T Best et al., “The emergence of native-language phonological influences in infants: A perceptual assimilation model,” The development of speech perception: The transition from speech sounds to spoken words, vol. 167, no. 224, pp. 233–277, 1994.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.
- “Layer-wise analysis of a self-supervised speech representation model,” in Proc.ASRU, 2021, pp. 914–921.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
- “An exploration of self-supervised pretrained representations for end-to-end speech recognition,” in Proc.ASRU, 2021, pp. 228–235.
- “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” Proc.Interspeech 2019, pp. 2080–2084, 2019.
- “One model, many languages: Meta-learning for multilingual text-to-speech,” Proc.Interspeech 2020, pp. 2972–2976, 2020.
- “Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual tts,” in Proc.ICASSP, 2021, pp. 6608–6612.
- “Crossspeech: Speaker-independent acoustic representation for cross-lingual speech synthesis,” in Proc.ICASSP, 2023, pp. 1–5.
- “Phonological features for 0-shot multilingual speech synthesis,” Proc.Interspeech 202, pp. 2942–2946, 2020.
- “Learning latent representations for style control and transfer in end-to-end speech synthesis,” in Proc.ICASSP, 2019, pp. 6945–6949.
- “Learning hierarchical representations for expressive speaking style in end-to-end speech synthesis,” in Proc.ASRU, 2019, pp. 184–191.
- “iemotts: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre,” IEEE ACM Trans. Audio Speech Lang. Process., pp. 1693–1705, 2023.
- “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc.ICML, 2018, pp. 5180–5189.
- “Auto-encoding variational bayes,” Proc.ICLR, 2013.
- “Msemotts: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 30, pp. 853–864, 2022.
- “Joint multi-scale cross-lingual speaking style transfer with bidirectional attention mechanism for automatic dubbing,” CoRR, vol. abs/2305.05203, 2023.
- “Towards multi-scale style control for expressive speech synthesis,” Proc.Interspeech 2021, pp. 4673–4677, 2021.
- “Speaker conditional wavernn: Towards universal neural vocoder for unseen speaker and recording conditions,” Proc.Interspeech 2020, pp. 235–239, 2020.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
- “Emotion recognition from speech using wav2vec 2.0 embeddings,” Proc.Interspeech 2021, pp. 3400–3404, 2021.
- “On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition,” in IEEE SLT, 2021, pp. 373–380.
- “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 7748–7759.
- “PyThaiNLP: Thai Natural Language Processing in Python,” June 2016.
- “Cross-speaker emotion transfer based on speaker condition layer normalization and semi-supervised training in text-to-speech,” CoRR, vol. abs/2110.04153, 2021.
- “Incorporating cross-speaker style transfer for multi-language text-to-speech.,” in Proc.Interspeech, 2021, pp. 1619–1623.
- “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” Proc.Interspeech, 2020.
- “Scaling speech technology to 1,000+ languages,” CoRR, vol. abs/2305.13516, 2023.
- “Robust speech recognition via large-scale weak supervision,” 2022.
- Yuke Li (76 papers)
- Xinfa Zhu (29 papers)
- Yi Lei (40 papers)
- Hai Li (159 papers)
- Junhui Liu (23 papers)
- Danming Xie (6 papers)
- Lei Xie (337 papers)