USAT: A Universal Speaker-Adaptive Text-to-Speech Approach (2404.18094v1)
Abstract: Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.
- S. Ö. Arik, M. Chrzanowski, A. Coates, G. F. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Y. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 195–204.
- Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” Proc. Int. Conf. Learn. Representations, 2021.
- J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 4779–4783.
- J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” in Adv. Neural Inf. Process. Syst., 2020.
- V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. A. Kudinov, “Grad-tts: A diffusion probabilistic model for text-to-speech,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8599–8608.
- X. Tan, T. Qin, F. K. Soong, and T. Liu, “A survey on neural speech synthesis,” arXiv:2106.15561, 2021.
- C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, S. Kang, G. Lei, D. Su, and D. Yu, “Durian: Duration informed attention network for speech synthesis,” in Proc. Interspeech, 2020, pp. 2027–2031.
- T. Wang, J. Tao, R. Fu, J. Yi, Z. Wen, and R. Zhong, “Spoken content and voice factorization for few-shot speaker adaptation,” in Proc. Interspeech, 2020, pp. 796–800.
- N. Morioka, H. Zen, N. Chen, Y. Zhang, and Y. Ding, “Residual adapters for few-shot text-to-speech speaker adaptation,” arXiv:2210.15868, 2022.
- M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, S. Zhao, and T. Liu, “Adaspeech: Adaptive text to speech for custom voice,” in Proc. Int. Conf. Learn. Representations, 2021.
- H. B. Moss, V. Aggarwal, N. Prateek, J. González, and R. Barra-Chicote, “BOFFIN TTS: few-shot speaker adaptation by bayesian optimization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7639–7643.
- A. Tjandra, R. Pang, Y. Zhang, and S. Karita, “Unsupervised learning of disentangled speech content and style representation,” in Proc. Interspeech, 2021, pp. 4089–4093.
- B. Zhao, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “nnspeech: Speaker-guided conditional variational autoencoder for zero-shot multi-speaker text-to-speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 4293–4297.
- Y. Wu, X. Tan, B. Li, L. He, S. Zhao, R. Song, T. Qin, and T. Liu, “Adaspeech 4: Adaptive text to speech in zero-shot scenarios,” in Proc. Interspeech, 2022, pp. 2568–2572.
- C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” arXiv:2301.02111, 2023.
- Z. Jiang, Y. Ren, Z. Ye, J. Liu, C. Zhang, Q. Yang, S. Ji, R. Huang, C. Wang, X. Yin, Z. Ma, and Z. Zhao, “Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias,” arXiv:2306.03509, 2023.
- W. Wang, S. Yang, and J. Sanjay, “Generalizable Zero-Shot speaker-adaptive Speech Synthesis with Disentangled Representations,” in Proc. Interspeech, 2023, pp. 2252–2256.
- C. Chien, J. Lin, C. Huang, P. Hsu, and H. Lee, “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 8588–8592.
- Y. A. Li, C. Han, V. S. Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” in Adv. Neural Inf. Process. Syst., 2023.
- N. Makishima, S. Suzuki, A. Ando, and R. Masumura, “Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data,” in Proc. Interspeech, 2022, pp. 526–530.
- E. Cooper, C. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 6184–6188.
- W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” in The Speaker and Language Recognition Workshop, 2018, pp. 74–81.
- D. Xin, Y. Saito, S. Takamichi, T. Koriyama, and H. Saruwatari, “Cross-lingual text-to-speech synthesis via domain adaptation and perceptual similarity regression in speaker space,” in Proc. Interspeech, 2020, pp. 2947–2951.
- E. Casanova, J. Weber, C. D. Shulby, A. C. Júnior, E. Gölge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 2709–2720.
- M. Kim, M. Jeong, B. J. Choi, S. Ahn, J. Y. Lee, and N. S. Kim, “Transfer learning framework for low-resource text-to-speech using a large-scale unlabeled speech corpus,” in Proc. Interspeech, 2022, pp. 788–792.
- K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” arXiv:2304.09116, 2023.
- Y. Zhang, H. Liu, Z. Li, X. Gao, G. Shi, and J. Jiang, “TCDM: effective large-factor image super-resolution via texture consistency diffusion,” IEEE Trans. Geosci. Remote. Sens., vol. 62, pp. 1–13, 2024.
- K. Inoue, S. Hara, M. Abe, T. Hayashi, R. Yamamoto, and S. Watanabe, “Semi-supervised speaker adaptation for end-to-end speech synthesis with pretrained models,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7634–7638.
- M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Speaker adaptation for hmm-based speech synthesis system using MLLR,” in Proc. Int. COCOSDA, 1998, pp. 273–276.
- J. Yamagishi, M. Tachibana, T. Masuko, and T. Kobayashi, “Speaking style adaptation using context clustering decision tree for hmm-based speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2004, pp. 5–8.
- S. Ö. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Adv. Neural Inf. Process. Syst., 2018, pp. 10 040–10 050.
- W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” in Proc. Int. Conf. Learn. Representations, 2018.
- H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
- C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Proc. Int. COCOSDA, 2013, pp. 1–4.
- J. Yang and L. He, “Towards universal text-to-speech,” in Proc. Interspeech, 2020, pp. 3171–3175.
- S. Huang, C. Lin, D. Liu, Y. Chen, and H. Lee, “Meta-tts: Meta-learning for few-shot speaker-adaptive text-to-speech,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 30, pp. 1558–1571, 2022.
- Y. Yan, X. Tan, B. Li, T. Qin, S. Zhao, Y. Shen, and T. Liu, “Adaspeech 2: Adaptive text to speech with untranscribed data,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 6613–6617.
- H. Kim, S. Kim, J. Yeom, and S. Yoon, “Unitspeech: Speaker-adaptive speech synthesis with untranscribed data,” arXiv:2306.16083, 2023.
- J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 5530–5540.
- X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, F. K. Soong, T. Qin, S. Zhao, and T. Liu, “Naturalspeech: End-to-end text to speech synthesis with human-level quality,” arXiv:2205.04421, 2022.
- R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4700–4709.
- V. Klimkov, S. Ronanki, J. Rohnke, and T. Drugman, “Fine-grained robust prosody transfer for single-speaker neural text-to-speech,” in Proc. Interspeech, 2019, pp. 4440–4444.
- B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
- J. Wang, C. Lan, C. Liu, Y. Ouyang, and T. Qin, “Generalizing to unseen domains: A survey on domain generalization,” in Proc. Int. Joint Conf. on Artif. Intell., 2021, pp. 4627–4635.
- Y. Ganin and V. S. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proc. Int. Conf. Mach. Learn., vol. 37, 2015, pp. 1180–1189.
- S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. S. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, 2021.
- K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018, pp. 2252–2256.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. , 2017, pp. 5998–6008.
- P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proc. NAACL-HLT, 2018, pp. 464–468.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in in Proc. Int. Conf. Learn. Representations, 2022.
- J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019.
- A. Ahamad, A. Anand, and P. Bhargava, “Accentdb: A database of non-native english accents to assist neural speech recognition,” in Proc. Lang. Resour. and Eval. Conf., 2020, pp. 5353–5360.
- A. Butryna, S. C. Chu, I. Demirsahin, A. Gutkin, L. Ha, F. He, M. Jansche, C. Johny, A. Katanova, O. Kjartansson, C. Li, T. Merkulova, Y. M. Oo, K. Pipatsrisawat, C. Rivera, S. Sarin, P. D. Silva, K. Sodimana, R. Sproat, T. Wattanavekin, and J. A. E. Wibawa, “Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: An overview,” arXiv:2010.06778, 2020.
- G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-arctic: A non-native english speech corpus,” in Proc. Interspeech, 2018, p. 2783–2787.
- A. Bradlow, L. Ackerman, L. Burchfield, L. Hesterberg, J. Luque, and K. Mok, “Allsstar: Archive of l1 and l2 scripted and spontaneous transcripts and recordings,” in Proc. of the Int. Congr. on Phonetic Sciences, 2010, pp. 356–359.
- C. Kim and R. M. Stern, “Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis,” in Proc. Interspeech, 2008, pp. 2598–2601.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv:2212.04356, 2022.
- H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “pyannote.audio: neural building blocks for speaker diarization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020.
- E. Bakhturina, Y. Zhang, and B. Ginsburg, “Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization,” in Proc. Interspeech, 2022.
- D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech : Multi-speaker adaptive text-to-speech generation,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 7748–7759.
- H. S. Heo, B. Lee, J. Huh, and J. S. Chung, “Clova baseline system for the voxceleb speaker recognition challenge 2020,” CoRR, vol. abs/2009.14153, 2020.
- X. Wang, J. Mi, B. Li, Y. Zhao, and J. Meng, “Catnet: Cross-modal fusion for audio–visual speech recognition,” Pattern Recognition Letters, vol. 178, pp. 216–222, 2024.
- T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: utokyo-sarulab system for voicemos challenge 2022,” in Proc. Interspeech 2022, pp. 4521–4525.
- W. C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540.
- L. Wan, Q. Wang, A. Papir, and I. Lopez-Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 4879–4883.
- B. Li, J. Sun, and C. M. Poskitt, “How generalizable are deepfake detectors? an empirical study,” CoRR, vol. abs/2308.04177, 2023.
- N. R. Koluguri, T. Park, and B. Ginsburg, “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 8102–8106.
- M. Bernard and H. Titeux, “Phonemizer: Text to phones transcription for multiple languages in python,” Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations, 2019.
- A. Rosenberg and B. Ramabhadran, “Bias and statistical significance in evaluating speech synthesis with mean opinion scores,” in Proc. Interspeech, 2017, pp. 3976–3980.
- A. Rosenberg, R. Fernandez, and B. Ramabhadran, “Measuring the effect of linguistic resources on prosody modeling for speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5114–5118.
- H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” The annals of mathematical statistics, pp. 50–60, 1947.
- F. Wilcoxon, “Individual comparisons by ranking methods,” in Breakthroughs in Statistics: Methodology and Distribution. Springer, 1992, pp. 196–202.
- X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, p. 101114, 2020.
- S. Seshadri, L. Juvela, O. Räsänen, and P. Alku, “Vocal effort based speaking style conversion using vocoder features and parallel learning,” IEEE Access, vol. 7, pp. 17 230–17 246, 2019.
- T. Nguyen, N. Pham, and A. Waibel, “SYNTACC : Synthesizing multi-accent speech by weight factorization,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- Wenbin Wang (44 papers)
- Yang Song (298 papers)
- Sanjay Jha (39 papers)