ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech (2211.03545v2)
Abstract: Speech representation learning has improved both speech understanding and speech synthesis tasks for single language. However, its ability in cross-lingual scenarios has not been explored. In this paper, we extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks, including cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes given a speech example and its transcription. By learning to reconstruct the masked parts of the input in different languages, our model shows great improvements over speaker-embedding-based multi-speaker TTS methods. Moreover, our framework is end-to-end for both the training and the inference without any finetuning effort. In cross-lingual multi-speaker voice cloning and cross-lingual multi-speaker speech editing tasks, our experiments show that our model outperforms speaker-embedding-based multi-speaker TTS methods.
- Xiaoran Fan (23 papers)
- Chao Pang (23 papers)
- Tian Yuan (3 papers)
- He Bai (50 papers)
- Renjie Zheng (29 papers)
- Pengfei Zhu (76 papers)
- Shuohuan Wang (30 papers)
- Junkun Chen (27 papers)
- Zeyu Chen (48 papers)
- Liang Huang (108 papers)
- Yu Sun (226 papers)
- Hua Wu (191 papers)