Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning (2310.04004v1)

Published 6 Oct 2023 in cs.SD and eess.AS

Abstract: Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” Proc. Interspeech 2017, pp. 4006–4010, 2017.
  2. W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: 2000-speaker neural text-to-speech.” arXiv preprint arXiv:1710.07654, 2017.
  3. Y. Zhou, C. Song, X. Li, L. Zhang, Z. Wu, Y. Bian, D. Su, and H. Meng, “Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis,” in Proc. Interspeech 2022, 2022, pp. 2573–2577.
  4. Q. Xie, X. Tian, G. Liu, K. Song, L. Xie, Z. Wu, H. Li, S. Shi, H. Li, F. Hong et al., “The multi-speaker multi-style voice cloning challenge 2021,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 8613–8617.
  5. B. J. Choi, M. Jeong, J. Y. Lee, and N. S. Kim, “Snac: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech,” IEEE Signal Processing Letters, vol. 29, pp. 2502–2506, 2022.
  6. R. Li, D. Pu, M. Huang, and B. Huang, “Unet-tts: Improving unseen speaker and style transfer in one-shot voice cloning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 8327–8331.
  7. S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” Advances in neural information processing systems, vol. 31, 2018.
  8. E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6184–6188.
  9. Y. Zhang, H. Che, J. Li, C. Li, X. Wang, and Z. Wang, “One-shot voice conversion based on speaker aware module,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5959–5963.
  10. M. Kang, D. Min, and S. J. Hwang, “Grad-stylespeech: Any-speaker adaptive text-to-speech synthesis with diffusion models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  11. E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in International Conference on Machine Learning.   PMLR, 2022, pp. 2709–2720.
  12. D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-stylespeech: Multi-speaker adaptive text-to-speech generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 7748–7759.
  13. R. Huang, Y. Ren, J. Liu, C. Cui, and Z. Zhao, “Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech,” Advances in Neural Information Processing Systems, vol. 35, pp. 10 970–10 983, 2022.
  14. X. Li, S. Liu, and Y. Shan, “A hierarchical speaker representation framework for one-shot singing voice conversion,” arXiv preprint arXiv:2206.13762, 2022.
  15. Y.-H. Chen, D.-Y. Wu, T.-H. Wu, and H.-y. Lee, “Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5954–5958.
  16. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  17. M. Whitehill, S. Ma, D. McDuff, and Y. Song, “Multi-reference neural tts stylization with adversarial cycle consistency,” Proc. Interspeech 2020, pp. 4442–4446, 2020.
  18. S. Karlapati, A. Moinet, A. Joly, V. Klimkov, D. Sáez-Trigueros, and T. Drugman, “Copycat: Many-to-many fine-grained prosody transfer for neural text-to-speech,” 2020, pp. 4387–4391.
  19. C. Qiang, P. Yang, H. Che, Y. Zhang, X. Wang, and Z. Wang, “Improving prosody for cross-speaker style transfer by semi-supervised style extractor and hierarchical modeling in speech synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  20. M. Kang, W. Han, S. J. Hwang, and E. Yang, “Zet-speech: Zero-shot adaptive emotion-controllable text-to-speech synthesis with diffusion and style-based models,” arXiv preprint arXiv:2305.13831, 2023.
  21. R. Valle, J. Li, R. Prenger, and B. Catanzaro, “Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6189–6193.
  22. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-tts: A diffusion probabilistic model for text-to-speech,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8599–8608.
  23. Y. Lei, S. Yang, X. Zhu, L. Xie, and D. Su, “Cross-speaker emotion transfer through information perturbation in emotional speech synthesis,” IEEE Signal Processing Letters, vol. 29, pp. 1948–1952, 2022.
  24. S.-g. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng, T. Qin, W. Chen, S. Yoon, and T.-Y. Liu, “Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior,” in International Conference on Learning Representations, 2021.
  25. J. chieh Chou and H.-Y. Lee, “One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” in Proc. Interspeech 2019, 2019, pp. 664–668.
  26. Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 4485–4495.
  27. H. Lu, Z. Wu, D. Dai, R. Li, S. Kang, J. Jia, and H. Meng, “One-shot voice conversion with global speaker embeddings.” in Interspeech, 2019, pp. 669–673.
  28. E. Casanova, C. D. Shulby, E. Gölge, N. M. Müller, F. S. d. Oliveira, A. Candido Junior, A. d. S. Soares, S. M. Aluísio, and M. A. Ponti, “Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech model,” in Proceedings, 2021.
  29. R. Fu, J. Tao, Z. Wen, and Y. Zheng, “Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 6930–6934.
  30. D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
  31. K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” arXiv preprint arXiv:2304.09116, 2023.
  32. Z. Jiang, J. Liu, Y. Ren, J. He, C. Zhang, Z. Ye, P. Wei, C. Wang, X. Yin, Z. Ma et al., “Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts,” arXiv preprint arXiv:2307.07218, 2023.
  33. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in international conference on machine learning.   PMLR, 2018, pp. 4693–4702.
  34. Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in International conference on machine learning.   PMLR, 2018, pp. 5180–5189.
  35. Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent representations for style control and transfer in end-to-end speech synthesis,” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6945–6949, 2019.
  36. T. Li, S. Yang, L. Xue, and L. Xie, “Controllable emotion transfer for end-to-end speech synthesis,” 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1–5, 2021.
  37. Y. Bian, C. Chen, Y. Kang, and Z. Pan, “Multi-reference tacotron by intercross training for style disentangling,transfer and control in speech synthesis,” CoRR, vol. abs/1904.02373, 2019.
  38. T. Li, X. Wang, Q. Xie, Z. Wang, M. Jiang, and L. Xie, “Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis,” in Proc. Interspeech 2022, 2022, pp. 5498–5502.
  39. T. Li, X. Wang, Q. Xie, Z. Wang, and L. Xie, “Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1448–1460, 2022.
  40. T. Li, C. Hu, J. Cong, X. Zhu, J. Li, Q. Tian, Y. Wang, and L. Xie, “Diclet-tts: Diffusion model based cross-lingual emotion transfer for text-to-speech — a study between english and mandarin,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1–13, 2023.
  41. Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International conference on machine learning.   PMLR, 2015, pp. 1180–1189.
  42. X. Zhu, Y. Lei, K. Song, Y. Zhang, T. Li, and L. Xie, “Multi-speaker expressive speech synthesis via multiple factors decoupling,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  43. H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” in Advances in Neural Information Processing Systems, 2021, pp. 16 251–16 265.
  44. W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative modeling for controllable speech synthesis,” ArXiv, vol. abs/1810.07217, 2019.
  45. Y. Liu, Z. Xu, G. Wang, K. Chen, B. Li, X. Tan, J. Li, L. He, and S. Zhao, “Delightfultts: The microsoft speech synthesis system for blizzard challenge 2021,” arXiv preprint arXiv:2110.12612, 2021.
  46. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” Proc. Interspeech 2020, pp. 5036–5040, 2020.
  47. K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 920–924.
  48. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 3171–3180.
  49. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
  50. Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
  51. X. An, F. K. Soong, and L. Xie, “Disentangling style and speaker attributes for tts style transfer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 646–658, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Tao Li (440 papers)
  2. Zhichao Wang (83 papers)
  3. Xinfa Zhu (29 papers)
  4. Jian Cong (16 papers)
  5. Qiao Tian (27 papers)
  6. Yuping Wang (56 papers)
  7. Lei Xie (337 papers)
Citations (1)