Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach (2404.18094v1)

Published 28 Apr 2024 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. S. Ö. Arik, M. Chrzanowski, A. Coates, G. F. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Y. Ng, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neural text-to-speech,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 195–204.
  2. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” Proc. Int. Conf. Learn. Representations, 2021.
  3. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 4779–4783.
  4. J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” in Adv. Neural Inf. Process. Syst., 2020.
  5. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. A. Kudinov, “Grad-tts: A diffusion probabilistic model for text-to-speech,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8599–8608.
  6. X. Tan, T. Qin, F. K. Soong, and T. Liu, “A survey on neural speech synthesis,” arXiv:2106.15561, 2021.
  7. C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, S. Kang, G. Lei, D. Su, and D. Yu, “Durian: Duration informed attention network for speech synthesis,” in Proc. Interspeech, 2020, pp. 2027–2031.
  8. T. Wang, J. Tao, R. Fu, J. Yi, Z. Wen, and R. Zhong, “Spoken content and voice factorization for few-shot speaker adaptation,” in Proc. Interspeech, 2020, pp. 796–800.
  9. N. Morioka, H. Zen, N. Chen, Y. Zhang, and Y. Ding, “Residual adapters for few-shot text-to-speech speaker adaptation,” arXiv:2210.15868, 2022.
  10. M. Chen, X. Tan, B. Li, Y. Liu, T. Qin, S. Zhao, and T. Liu, “Adaspeech: Adaptive text to speech for custom voice,” in Proc. Int. Conf. Learn. Representations, 2021.
  11. H. B. Moss, V. Aggarwal, N. Prateek, J. González, and R. Barra-Chicote, “BOFFIN TTS: few-shot speaker adaptation by bayesian optimization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7639–7643.
  12. A. Tjandra, R. Pang, Y. Zhang, and S. Karita, “Unsupervised learning of disentangled speech content and style representation,” in Proc. Interspeech, 2021, pp. 4089–4093.
  13. B. Zhao, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “nnspeech: Speaker-guided conditional variational autoencoder for zero-shot multi-speaker text-to-speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 4293–4297.
  14. Y. Wu, X. Tan, B. Li, L. He, S. Zhao, R. Song, T. Qin, and T. Liu, “Adaspeech 4: Adaptive text to speech in zero-shot scenarios,” in Proc. Interspeech, 2022, pp. 2568–2572.
  15. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,” arXiv:2301.02111, 2023.
  16. Z. Jiang, Y. Ren, Z. Ye, J. Liu, C. Zhang, Q. Yang, S. Ji, R. Huang, C. Wang, X. Yin, Z. Ma, and Z. Zhao, “Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias,” arXiv:2306.03509, 2023.
  17. W. Wang, S. Yang, and J. Sanjay, “Generalizable Zero-Shot speaker-adaptive Speech Synthesis with Disentangled Representations,” in Proc. Interspeech, 2023, pp. 2252–2256.
  18. C. Chien, J. Lin, C. Huang, P. Hsu, and H. Lee, “Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 8588–8592.
  19. Y. A. Li, C. Han, V. S. Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” in Adv. Neural Inf. Process. Syst., 2023.
  20. N. Makishima, S. Suzuki, A. Ando, and R. Masumura, “Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data,” in Proc. Interspeech, 2022, pp. 526–530.
  21. E. Cooper, C. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 6184–6188.
  22. W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” in The Speaker and Language Recognition Workshop, 2018, pp. 74–81.
  23. D. Xin, Y. Saito, S. Takamichi, T. Koriyama, and H. Saruwatari, “Cross-lingual text-to-speech synthesis via domain adaptation and perceptual similarity regression in speaker space,” in Proc. Interspeech, 2020, pp. 2947–2951.
  24. E. Casanova, J. Weber, C. D. Shulby, A. C. Júnior, E. Gölge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 2709–2720.
  25. M. Kim, M. Jeong, B. J. Choi, S. Ahn, J. Y. Lee, and N. S. Kim, “Transfer learning framework for low-resource text-to-speech using a large-scale unlabeled speech corpus,” in Proc. Interspeech, 2022, pp. 788–792.
  26. K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” arXiv:2304.09116, 2023.
  27. Y. Zhang, H. Liu, Z. Li, X. Gao, G. Shi, and J. Jiang, “TCDM: effective large-factor image super-resolution via texture consistency diffusion,” IEEE Trans. Geosci. Remote. Sens., vol. 62, pp. 1–13, 2024.
  28. K. Inoue, S. Hara, M. Abe, T. Hayashi, R. Yamamoto, and S. Watanabe, “Semi-supervised speaker adaptation for end-to-end speech synthesis with pretrained models,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7634–7638.
  29. M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Speaker adaptation for hmm-based speech synthesis system using MLLR,” in Proc. Int. COCOSDA, 1998, pp. 273–276.
  30. J. Yamagishi, M. Tachibana, T. Masuko, and T. Kobayashi, “Speaking style adaptation using context clustering decision tree for hmm-based speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2004, pp. 5–8.
  31. S. Ö. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Adv. Neural Inf. Process. Syst., 2018, pp. 10 040–10 050.
  32. W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” in Proc. Int. Conf. Learn. Representations, 2018.
  33. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
  34. C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Proc. Int. COCOSDA, 2013, pp. 1–4.
  35. J. Yang and L. He, “Towards universal text-to-speech,” in Proc. Interspeech, 2020, pp. 3171–3175.
  36. S. Huang, C. Lin, D. Liu, Y. Chen, and H. Lee, “Meta-tts: Meta-learning for few-shot speaker-adaptive text-to-speech,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 30, pp. 1558–1571, 2022.
  37. Y. Yan, X. Tan, B. Li, T. Qin, S. Zhao, Y. Shen, and T. Liu, “Adaspeech 2: Adaptive text to speech with untranscribed data,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 6613–6617.
  38. H. Kim, S. Kim, J. Yeom, and S. Yoon, “Unitspeech: Speaker-adaptive speech synthesis with untranscribed data,” arXiv:2306.16083, 2023.
  39. J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 5530–5540.
  40. X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, F. K. Soong, T. Qin, S. Zhao, and T. Liu, “Naturalspeech: End-to-end text to speech synthesis with human-level quality,” arXiv:2205.04421, 2022.
  41. R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4700–4709.
  42. V. Klimkov, S. Ronanki, J. Rohnke, and T. Drugman, “Fine-grained robust prosody transfer for single-speaker neural text-to-speech,” in Proc. Interspeech, 2019, pp. 4440–4444.
  43. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
  44. J. Wang, C. Lan, C. Liu, Y. Ouyang, and T. Qin, “Generalizing to unseen domains: A survey on domain generalization,” in Proc. Int. Joint Conf. on Artif. Intell., 2021, pp. 4627–4635.
  45. Y. Ganin and V. S. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in Proc. Int. Conf. Mach. Learn., vol. 37, 2015, pp. 1180–1189.
  46. S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. S. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662, 2021.
  47. K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018, pp. 2252–2256.
  48. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. , 2017, pp. 5998–6008.
  49. P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proc. NAACL-HLT, 2018, pp. 464–468.
  50. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in in Proc. Int. Conf. Learn. Representations, 2022.
  51. J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019.
  52. A. Ahamad, A. Anand, and P. Bhargava, “Accentdb: A database of non-native english accents to assist neural speech recognition,” in Proc. Lang. Resour. and Eval. Conf., 2020, pp. 5353–5360.
  53. A. Butryna, S. C. Chu, I. Demirsahin, A. Gutkin, L. Ha, F. He, M. Jansche, C. Johny, A. Katanova, O. Kjartansson, C. Li, T. Merkulova, Y. M. Oo, K. Pipatsrisawat, C. Rivera, S. Sarin, P. D. Silva, K. Sodimana, R. Sproat, T. Wattanavekin, and J. A. E. Wibawa, “Google crowdsourced speech corpora and related open-source resources for low-resource languages and dialects: An overview,” arXiv:2010.06778, 2020.
  54. G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-arctic: A non-native english speech corpus,” in Proc. Interspeech, 2018, p. 2783–2787.
  55. A. Bradlow, L. Ackerman, L. Burchfield, L. Hesterberg, J. Luque, and K. Mok, “Allsstar: Archive of l1 and l2 scripted and spontaneous transcripts and recordings,” in Proc. of the Int. Congr. on Phonetic Sciences, 2010, pp. 356–359.
  56. C. Kim and R. M. Stern, “Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis,” in Proc. Interspeech, 2008, pp. 2598–2601.
  57. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv:2212.04356, 2022.
  58. H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “pyannote.audio: neural building blocks for speaker diarization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020.
  59. E. Bakhturina, Y. Zhang, and B. Ginsburg, “Shallow Fusion of Weighted Finite-State Transducer and Language Model for Text Normalization,” in Proc. Interspeech, 2022.
  60. D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech : Multi-speaker adaptive text-to-speech generation,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 7748–7759.
  61. H. S. Heo, B. Lee, J. Huh, and J. S. Chung, “Clova baseline system for the voxceleb speaker recognition challenge 2020,” CoRR, vol. abs/2009.14153, 2020.
  62. X. Wang, J. Mi, B. Li, Y. Zhao, and J. Meng, “Catnet: Cross-modal fusion for audio–visual speech recognition,” Pattern Recognition Letters, vol. 178, pp. 216–222, 2024.
  63. T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: utokyo-sarulab system for voicemos challenge 2022,” in Proc. Interspeech 2022, pp. 4521–4525.
  64. W. C. Huang, E. Cooper, Y. Tsao, H.-M. Wang, T. Toda, and J. Yamagishi, “The VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4536–4540.
  65. L. Wan, Q. Wang, A. Papir, and I. Lopez-Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 4879–4883.
  66. B. Li, J. Sun, and C. M. Poskitt, “How generalizable are deepfake detectors? an empirical study,” CoRR, vol. abs/2308.04177, 2023.
  67. N. R. Koluguri, T. Park, and B. Ginsburg, “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 8102–8106.
  68. M. Bernard and H. Titeux, “Phonemizer: Text to phones transcription for multiple languages in python,” Journal of Open Source Software, vol. 6, no. 68, p. 3958, 2021.
  69. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations, 2019.
  70. A. Rosenberg and B. Ramabhadran, “Bias and statistical significance in evaluating speech synthesis with mean opinion scores,” in Proc. Interspeech, 2017, pp. 3976–3980.
  71. A. Rosenberg, R. Fernandez, and B. Ramabhadran, “Measuring the effect of linguistic resources on prosody modeling for speech synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5114–5118.
  72. H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,” The annals of mathematical statistics, pp. 50–60, 1947.
  73. F. Wilcoxon, “Individual comparisons by ranking methods,” in Breakthroughs in Statistics: Methodology and Distribution.   Springer, 1992, pp. 196–202.
  74. X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, T. Kinnunen, K. A. Lee et al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Computer Speech & Language, vol. 64, p. 101114, 2020.
  75. S. Seshadri, L. Juvela, O. Räsänen, and P. Alku, “Vocal effort based speaking style conversion using vocoder features and parallel learning,” IEEE Access, vol. 7, pp. 17 230–17 246, 2019.
  76. T. Nguyen, N. Pham, and A. Waibel, “SYNTACC : Synthesizing multi-accent speech by weight factorization,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Wenbin Wang (44 papers)
  2. Yang Song (298 papers)
  3. Sanjay Jha (39 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com