Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion (2305.07204v3)

Published 12 May 2023 in eess.AS and cs.SD

Abstract: Zero-shot voice conversion (VC) converts source speech into the voice of any desired speaker using only one utterance of the speaker without requiring additional model updates. Typical methods use a speaker representation from a pre-trained speaker verification (SV) model or learn speaker representation during VC training to achieve zero-shot VC. However, existing speaker modeling methods overlook the variation of speaker information richness in temporal and frequency channel dimensions of speech. This insufficient speaker modeling hampers the ability of the VC model to accurately represent unseen speakers who are not in the training dataset. In this study, we present a robust zero-shot VC model with multi-level temporal-channel retrieval, referred to as MTCR-VC. Specifically, to flexibly adapt to the dynamic-variant speaker characteristic in the temporal and channel axis of the speech, we propose a novel fine-grained speaker modeling method, called temporal-channel retrieval (TCR), to find out when and where speaker information appears in speech. It retrieves variable-length speaker representation from both temporal and channel dimensions under the guidance of a pre-trained SV model. Besides, inspired by the hierarchical process of human speech production, the MTCR speaker module stacks several TCR blocks to extract speaker representations from multi-granularity levels. Furthermore, to achieve better speech disentanglement and reconstruction, we introduce a cycle-based training strategy to simulate zero-shot inference recurrently. We adopt perpetual constraints on three aspects, including content, style, and speaker, to drive this process. Experiments demonstrate that MTCR-VC is superior to the previous zero-shot VC methods in modeling speaker timbre while maintaining good speech naturalness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” in International Speech Communication Association (Interspeech), 2017, pp. 3364–3368.
  2. C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2016, pp. 1–6.
  3. L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in International Conference on Multimedia and Expo (ICME), 2016, pp. 1–6.
  4. T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Maskcyclegan-vc: Learning non-parallel voice conversion with filling in frames,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5919–5923.
  5. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks,” in Spoken Language Technology Workshop (SLT), 2018, pp. 266–273.
  6. Z. Wang, X. Zhou, F. Yang, T. Li, H. Du, L. Xie, W. Gan, H. Chen, and H. Li, “Enriching source style transfer in recognition-synthesis based non-parallel voice conversion,” in International Speech Communication Association (Interspeech), 2021, pp. 831–835.
  7. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning (ICML), 2019, pp. 5210–5219.
  8. K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in International Conference on Machine Learning (ICML), 2020, pp. 7836–7846.
  9. Y. Zhang, H. Che, J. Li, C. Li, X. Wang, and Z. Wang, “One-shot voice conversion based on speaker aware module,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5959–5963.
  10. Y. Gu, Z. Zhang, X. Yi, and X. Zhao, “Mediumvc: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features,” Arxiv, 2021.
  11. J. chieh Chou and H.-Y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” in International Speech Communication Association (Interspeech), 2019, pp. 664–668.
  12. Y.-H. Chen, D.-Y. Wu, T.-H. Wu, and H.-y. Lee, “Again-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5954–5958.
  13. H. Zhang, Z. Cai, X. Qin, and M. Li, “Sig-vc: A speaker information guided zero-shot voice conversion system for both human beings and machines,” in Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6567–65 571.
  14. H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Avqvc: One-shot voice conversion by vector quantization with applying contrastive learning,” in Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4613–4617.
  15. J. Ebbers, M. Kuhlmann, T. Cord-Landwehr, and R. Haeb-Umbach, “Contrastive predictive coding supported factorized variational autoencoder for unsupervised learning of disentangled speech representations,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3860–3864.
  16. H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” in Neural Information Processing Systems(NeurIPS), 2021, pp. 16 251–16 265.
  17. D. Wang, L. Deng, Y. T. Yeung, X. Chen, X. Liu, and H. Meng, “Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion,” in International Speech Communication Association (Interspeech), 2021, pp. 1344–1348.
  18. S. Yang, M. Tantrawenith, H. Zhuang, Z. Wu, A. Sun, J. Wang, N. Cheng, H. Tang, X. Zhao, J. Wang, and H. Meng, “Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion,” in International Speech Communication Association (Interspeech), 2022, pp. 2553–2557.
  19. J. Wang, J. Li, X. Zhao, Z. Wu, S. Kang, and H. Meng, “Adversarially learning disentangled speech representations for robust multi-factor voice conversion,” Arxiv, 2021.
  20. R. Xiao, X. Xing, J. Yang, and X. Xu, “Ca-vc: A novel zero-shot voice conversion method with channel attention,” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2021, pp. 800–807.
  21. H. Du and L. Xie, “Improving robustness of one-shot voice conversion with deep discriminative speaker encoder,” in International Speech Communication Association (Interspeech), 2021, pp. 1379–1383.
  22. D.-Y. Wu, Y.-H. Chen, and H. yi Lee, “Vqvc+: One-shot voice conversion by vector quantization and u-net architecture,” in International Speech Communication Association (Interspeech), 2020, pp. 4691–4695.
  23. Y. Y. Lin, C. M. Chien, J. hao Lin, H. yi Lee, and L.-S. Lee, “Fragmentvc: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5939–5943.
  24. D. Yin, X. Ren, C. Luo, Y. Wang, Z. Xiong, and W. Zeng, “Retriever: Learning content-style representation as a token-level bipartite graph,” in International Conference on Learning Representations (ICLR), 2021.
  25. T. Ishihara and D. Saito, “Attention-based speaker embeddings for one-shot voice conversion,” in International Speech Communication Association (Interspeech), 2020, pp. 806–810.
  26. X. Li, S. Liu, and Y. Shan, “A hierarchical speaker representation framework for one-shot singing voice conversion,” in International Speech Communication Association (Interspeech), 2022, pp. 4307–4311.
  27. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 234–241.
  28. R. Li, D. Pu, M. Huang, and B. Huang, “Unet-tts: Improving unseen speaker and style transfer in one-shot voice cloning,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8327–8331.
  29. B. Gold, N. Morgan, D. Ellis, and D. O’Shaughnessy, “Speech and audio signal processing: Processing and perception of speech and music, second edition,” The Journal of the Acoustical Society of America, vol. 132, pp. 1861–2, 09 2012.
  30. T. Liu, R. K. Das, K. A. Lee, and H. Li, “Mfa: Tdnn with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7517–7521.
  31. M. Sang and J. H. Hansen, “Multi-frequency information enhanced channel attention module for speaker representation learning,” in International Speech Communication Association (Interspeech), 2022, pp. 321–325.
  32. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in International Speech Communication Association (Interspeech), 2020, pp. 3830–3834.
  33. Wikipedia, “Human voice,” https://en.wikipedia.org/wiki/Human_voice.
  34. Z. Wang, X. Wang, L. Xie, Y. Chen, Q. Tian, and Y. Wang, “Delivering speaking style in low-resource voice conversion with multi-factor constraints,” ArXiv, 2022.
  35. Y. Stylianou, O. Cappe, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.
  36. T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
  37. S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and K. Prahallad, “Voice conversion using artificial neural networks,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2009, pp. 3893–3896.
  38. L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4869–4873.
  39. V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” Neural Information Processing Systems(NeurIPS), vol. 27, 2014.
  40. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Conference on computer vision and pattern recognition (CVPR), 2018, pp. 7132–7141.
  41. Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “Tam: Temporal adaptive module for video recognition,” in International Conference on Computer Vision (ICCV), 2021, pp. 13 708–13 718.
  42. Q. Liu, X. Che, and M. Bie, “R-stan: Residual spatial-temporal attention network for action recognition,” IEEE Access, pp. 82 246–82 255, 2019.
  43. Y. Fu, X. Wang, Y. Wei, and T. Huang, “Sta: Spatial-temporal attention for large-scale video-based person re-identification,” in Proceedings of the AAAI conference on artificial intelligence, 2019, pp. 8287–8294.
  44. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16.   Springer, 2020, pp. 213–229.
  45. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards End-to-end Prosody Transfer for Expressive Speech Synthesis with Tacotron,” in International Conference on Machine Learning (ICML), 2018, pp. 4693–4702.
  46. T. Li, X. Wang, Q. Xie, Z. Wang, M. Jiang, and L. Xie, “Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis,” in International Speech Communication Association (Interspeech), 2022, pp. 5498–5502.
  47. Z. Wang, X. Wang, Q. Xie, T. Li, L. Xie, Q. Tian, and Y. Wang, “MSM-VC: high-fidelity source style transfer for non-parallel voice conversion by multi-scale style modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3883–3895, 2023.
  48. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Neural Information Processing Systems(NeurIPS), 2017.
  49. R. Liu, B. Sisman, G. Gao, and H. Li, “Expressive TTS training with frame and style reconstruction loss,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1806–1818, 2021.
  50. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards End-to-End Speech Synthesis,” in International Speech Communication Association (Interspeech), 2017, pp. 4006–4010.
  51. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in International Speech Communication Association (Interspeech), 2019, pp. 1526–1530.
  52. C. Veaux, J. Yamagishi, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit.”   University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2016.
  53. J. Kominek and A. W. Black, “The cmu arctic speech databases,” in 5th ISCA Workshop on Speech Synthesis (SSW 5), 2004, pp. 223–224.
  54. E. Bakhturina, V. Lavrukhin, B. Ginsburg, and Y. Zhang, “Hi-fi multi-speaker english tts dataset,” ArXiv, 2021.
  55. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in International conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 5206–5210.
  56. J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in International Speech Communication Association (Interspeech), 2018, pp. 1086–1090.
  57. K. Zhou, B. Sisman, R. Liu, and H. Li, “Seen and Unseen Emotional Style Transfer for Voice Conversion with a New Emotional Speech Dataset,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 920–924.
  58. Q. Tian, Y. Chen, Z. Zhang, H. Lu, L. Chen, L. Xie, and S. Liu, “TFGAN: Time and Frequency Domain based Generative Adversarial Network for High-fidelity Speech Synthesis,” Arxiv, 2020.
  59. J. hao Lin, Y. Y. Lin, C.-M. Chien, and H. yi Lee, “S2vc: A framework for any-to-any voice conversion with self-supervised pretrained representations,” in International Speech Communication Association (Interspeech), 2021, pp. 836–840.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhichao Wang (83 papers)
  2. Liumeng Xue (24 papers)
  3. Qiuqiang Kong (86 papers)
  4. Lei Xie (337 papers)
  5. Yuanzhe Chen (19 papers)
  6. Qiao Tian (27 papers)
  7. Yuping Wang (56 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com