Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion (2405.00930v2)

Published 2 May 2024 in cs.SD and eess.AS

Abstract: One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively disentangle via a concise neural network. The proposed model utilizes Siamese encoders to learn clean representations, further enhanced by the designed mutual information estimator. The Siamese structure and the newly designed convolution module contribute to the lightweight of our model while ensuring performance in diverse voice conversion tasks. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. R. Yuan, Y. Wu, J. Li, and J. Kim, “Deid-vc: Speaker de-identification via zero-shot pseudo voice conversion,” in the 23rd Annual Conference of the International Speech Communication Association, 2022, pp. 2593–2597.
  2. B. M. L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent, “Evaluating voice conversion-based privacy protection against informed attackers,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 2802–2806.
  3. B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2021.
  4. T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in the 26th European Signal Processing Conference, 2018, pp. 2100–2104.
  5. T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion,” in the 20th Annual Conference of the International Speech Communication Association, 2019, pp. 679–683.
  6. Y. A. Li, A. Zare, and N. Mesgarani, “Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,” in the 22nd Annual Conference of the International Speech Communication Association, 2021, pp. 1349–1353.
  7. M. Chen, Y. Zhou, H. Huang, and T. Hain, “Efficient non-autoregressive gan voice conversion using vqwav2vec features and dynamic convolution,” arXiv preprints arXiv:2203.17172, 2022.
  8. M. Luong and V. Tran, “Many-to-many voice conversion based feature disentanglement using variational autoencoder,” in the 22nd Annual Conference of the International Speech Communication Association, 2021, pp. 851–855.
  9. R. Xiao, H. Zhang, and Y. Lin, “Dgc-vector: A new speaker embedding for zero-shot voice conversion,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 6547–6551.
  10. Z. Liu, S. Wang, and N. Chen, “Automatic speech disentanglement for voice conversion using rank module and speech augmentation,” in the 24th Annual Conference of the International Speech Communication Association, 2023, pp. 2298–2302.
  11. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in the 36th International Conference on Machine Learning, vol. 97, 2019, pp. 5210–5219.
  12. J. Chou and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” in the 20th Annual Conference of the International Speech Communication Association, 2019, pp. 664–668.
  13. D. Wu and H. Lee, “One-shot voice conversion by vector quantization,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 7734–7738.
  14. S. Yuan, P. Cheng, R. Zhang, W. Hao, Z. Gan, and L. Carin, “Improving zero-shot voice style transfer via disentangled representation learning,” in the 9th International Conference on Learning Representations, 2021.
  15. K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in the 37th International Conference on Machine Learning, vol. 119, 2020, pp. 7836–7846.
  16. T. Liu, K. A. Lee, Q. Wang, and H. Li, “Disentangling voice and content with self-supervision for speaker recognition,” in the 37th Conference on Neural Information Processing Systems, 2023.
  17. S. H. Mun, M. H. Han, C. Moon, and N. S. Kim, “Eend-demux: End-to-end neural speaker diarization via demultiplexed speaker embeddings,” arXiv preprint arXiv:2312.06065, 2023.
  18. S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. Schuller, “Survey of deep representation learning for speech emotion recognition,” IEEE Transactions on Affective Computing, vol. 14, no. 2, pp. 1634–1654, 2021.
  19. N. Hou, C. Xu, E. S. Chng, and H. Li, “Learning disentangled feature representations for speech enhancement via adversarial training,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 666–670.
  20. W. Wang, Y. Song, and S. Jha, “Generalizable zero-shot speaker adaptive speech synthesis with disentangled representations,” in the 24th Annual Conference of the International Speech Communication Association, 2023.
  21. C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “Speechsplit2.0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 6332–6336.
  22. Y. Chen, D. Wu, T. Wu, and H. Lee, “Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 5954–5958.
  23. D. Wu, Y. Chen, and H. Lee, “Vqvc+: One-shot voice conversion by vector quantization and u-net architecture,” in the 21st Annual Conference of the International Speech Communication Association, 2020, pp. 4691–4695.
  24. D. Wang, L. Deng, Y. T. Yeung, X. Chen, X. Liu, and H. Meng, “Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion,” in the 22nd Annual Conference of the International Speech Communication Association, 2021, pp. 1344–1348.
  25. H. Yang, L. Deng, Y. T. Yeung, N. Zheng, and Y. Xu, “Streamable speech representation disentanglement and multi-level prosody modeling for live one-shot voice conversion,” in the 23rd Annual Conference of the International Speech Communication Association, 2022, pp. 2578–2582.
  26. H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Avqvc: One-shot voice conversion by vector quantization with applying contrastive learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 4613–4617.
  27. M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, 2018, pp. 531–540.
  28. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  29. M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in the 13th International Conference on Artificial Intelligence and Statistics, 2010, pp. 297–304.
  30. B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker, “On variational bounds of mutual information,” in the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97, 2019, pp. 5171–5180.
  31. P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “Club: A contrastive log-ratio upper bound of mutual information,” in the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 119, 2020, pp. 1779–1788.
  32. X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in IEEE International Conference on Computer Vision, 2017, pp. 1510–1519.
  33. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
  34. F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 636–644.
  35. J. Yamagishi, C. Veaux, and K. MacDonald, “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
  36. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in the 3rd International Conference on Learning Representations, 2015.
  37. N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, 2018, pp. 2415–2424.
  38. L. Huang, T. Yuan, Y. Liang, Z. Chen, C. Wen, Y. Xie, J. Zhang, and D. Ke, “Limi-vc: A light weight voice conversion model with mutual information disentanglement,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, pp. 1–5.
  39. J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: Transforming mandarin asr research into industrial scale,” CoRR, vol. abs/1808.10583, 2018.
  40. L. van der Maaten and G. Hinton, “Viualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 11 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Pengcheng Li (60 papers)
  2. Jianzong Wang (144 papers)
  3. Xulong Zhang (60 papers)
  4. Yong Zhang (660 papers)
  5. Jing Xiao (267 papers)
  6. Ning Cheng (96 papers)
Citations (1)