Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning (2404.19212v1)

Published 30 Apr 2024 in cs.SD and eess.AS

Abstract: Using unsupervised learning to disentangle speech into content, rhythm, pitch, and timbre for voice conversion has become a hot research topic. Existing works generally take into account disentangling speech components through human-crafted bottleneck features which can not achieve sufficient information disentangling, while pitch and rhythm may still be mixed together. There is a risk of information overlap in the disentangling process which results in less speech naturalness. To overcome such limits, we propose a two-stage model to disentangle speech representations in a self-supervised manner without a human-crafted bottleneck design, which uses the Mutual Information (MI) with the designed upper bound estimator (IFUB) to separate overlapping information between speech components. Moreover, we design a Joint Text-Guided Consistent (TGC) module to guide the extraction of speech content and eliminate timbre leakage issues. Experiments show that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness, speech naturalness, and similarity. Audio samples can be found at https://largeaudiomodel.com/eadvc.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. H. Abdullah, W. Garcia, C. Peeters, P. Traynor, K. R. B. Butler, and J. Wilson, “Practical hidden voice attacks against speech and speaker recognition systems,” in NDSS, 2019.
  2. H. Abdullah, M. S. Rahman, W. Garcia, K. Warren, A. S. Yadav, T. Shrimpton, and P. Traynor, “Hear "no evil", see "kenansville"*: Efficient and transferable black-box attacks on speech recognition and voice identification systems,” in IEEE SP, 2021, pp. 712–729.
  3. S. Ahmed, I. Shumailov, N. Papernot, and K. Fawaz, “Towards more robust keyword spotting for voice assistants,” in USENIX Security Symposium, 2022, pp. 2655–2672.
  4. Z. Zhao, S. Ma, Y. Jia, J. Hou, L. Yang, and J. Wang, “Mix-guided VC: any-to-many voice conversion by combining ASR and TTS bottleneck features,” in ISCSLP, 2022, pp. 96–100.
  5. Z. Ning, Q. Xie, P. Zhu, Z. Wang, L. Xue, J. Yao, L. Xie, and M. Bi, “Expressive-vc: Highly expressive voice conversion with attention fusion of bottleneck and perturbation features,” CoRR, vol. abs/2211.04710, 2022.
  6. X. Zhao, F. Liu, C. Song, Z. Wu, S. Kang, D. Tuo, and H. Meng, “Disentangleing content and fine-grained prosody information via hybrid ASR bottleneck features for voice conversion,” CoRR, vol. abs/2203.12813, 2022.
  7. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks,” in IEEE SLT, 2018, pp. 266–273.
  8. F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba, “High-quality nonparallel voice conversion based on cycle-consistent adversarial network,” in ICASSP, 2018, pp. 5279–5283.
  9. W. Huang, H. Luo, and H. Hwang, “Unsupervised representation disentanglement using cross domain features and adversarial learning in variational autoencoder based voice conversion,” IEEE Trans. Emerg. Top. Comput. Intell., vol. 4, no. 4, pp. 468–479.
  10. Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,” in ICASSP, 2018, pp. 5274–5278.
  11. H. Tang, X. Zhang, J. Wang, N. Cheng, Z. Zeng, E. Xiao, and J. Xiao, “Tgavc: Improving autoencoder voice conversion with text-guided and adversarial training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 938–945.
  12. X. Zhang, J. Wang, N. Cheng, E. Xiao, and J. Xiao, “Cyclegean: Cycle generative enhanced adversarial network for voice conversion,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 930–937.
  13. H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “AVQVC: One-shot voice conversion by vector quantization with applying contrastive learning,” in ICASSP, 2022, pp. 4613–4617.
  14. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AutoVC: Zero-shot voice style transfer with only autoencoder loss,” in ICML, vol. 97, 2019, pp. 5210–5219.
  15. J. Chou, C. Yeh, H. Lee, and L. Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” in Interspeech, 2018.
  16. J. Chou and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” in Interspeech, 2019.
  17. H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Avqvc: One-shot voice conversion by vector quantization with applying contrastive learning,” ICASSP, pp. 4613–4617, 2022.
  18. K. Qian, Z. Jin, M. A. Hasegawa-Johnson, and G. J. Mysore, “F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder,” ICASSP, pp. 6284–6288, 2020.
  19. K. Qian, Y. Zhang, S. Chang, and M. Hasegawa-Johnson, “Unsupervised speech decomposition via triple information bottleneck,” in ICML, 2020.
  20. C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “Speechsplit2.0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in ICASSP, 2022.
  21. Z. Liu, S. Wang, and N. Chen, “Automatic speech disentanglement for voice conversion using rank module and speech augmentation,” in Interspeech, 2023.
  22. C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in APSIPA, 2016, pp. 1–6.
  23. W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, and H. Wang, “Voice conversion based on cross-domain features using variational auto encoders,” ISCSLP, pp. 51–55, 2018.
  24. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC: non-parallel voice conversion with auxiliary classifier variational autoencoder,” IEEE ACM TASLP, vol. 27, no. 9, pp. 1432–1443, 2019.
  25. Z. Zhao, S. Ma, Y. Jia, J. Hou, L. Yang, and J. Wang, “Mix-guided vc: Any-to-many voice conversion by combining asr and tts bottleneck features,” ISCSLP, pp. 96–100, 2022.
  26. Z. Ning, Q. Xie, P. Zhu, Z. Wang, L. Xue, J. Yao, L. Xie, and M. Bi, “Expressive-vc: Highly expressive voice conversion with attention fusion of bottleneck and perturbation features,” ICASSP, pp. 1–5, 2023.
  27. X. Zhao, F. Liu, C. Song, Z. Wu, S. Kang, D. Tuo, and H. M. Meng, “Disentangling content and fine-grained prosody information via hybrid asr bottleneck features for voice conversion,” ICASSP, pp. 7022–7026, 2022.
  28. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: non-parallel many-to-many voice conversion using star generative adversarial networks,” IEEE SLT, pp. 266–273, 2018.
  29. F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba, “High-quality nonparallel voice conversion based on cycle-consistent adversarial network,” ICASSP, pp. 5279–5283, 2018.
  30. X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Voice conversion with denoising diffusion probabilistic gan models,” in 19th International Conference on Advanced Data Mining and Applications, 2023, pp. 154–167.
  31. J. Chou and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” in Interspeech, 2019, p. 664–668.
  32. Y. Deng, H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “PMVC: data augmentation-based prosody modeling for expressive voice conversion,” in ACM MM, 2023, pp. 184–192.
  33. Y. Deng, J. Wang, X. Zhang, N. Cheng, and J. Xiao, “Learning expressive disentangled speech representations with soft speech units and adversarial style augmentation,” in IJCNN, 2024.
  34. Y. Deng, H. Tang, X. Zhang, N. Cheng, J. Xiao, and J. Wang, “Learning disentangled speech representations with contrastive learning and time-invariant retrieval,” in ICASSP, 2024, pp. 7150–7154.
  35. A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W. Hsu, A. Mohamed, and E. Dupoux, “Speech resynthesis from discrete disentangled self-supervised representations,” in Interspeech, 2021, pp. 3615–3619.
  36. H. Lu, Z. Wu, D. Dai, R. Li, S. Kang, J. Jia, and H. M. Meng, “One-shot voice conversion with global speaker embeddings,” in Interspeech, 2019.
  37. W. Gan, B. Wen, Y. Yan, H. Chen, Z. Wang, H. Du, L. Xie, K. Guo, and H. Li, “Iqdubbing: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion,” CoRR, vol. abs/2201.00269, 2022.
  38. S. Yang, M. Tantrawenith, H.-W. Zhuang, Z. Wu, A. Sun, J. Wang, N. Cheng, H. Tang, X. Zhao, J. Wang, and H. M. Meng, “Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion,” in Interspeech, 2022.
  39. S. Wang and D. Borth, “Zero-shot voice conversion via self-supervised prosody representation learning,” in IJCNN, 2022, pp. 1–8.
  40. Y. Souri, E. Noury, and E. Adeli, “Deep relative attributes,” in ACCV, vol. 10115, 2016, pp. 118–133.
  41. P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “Club: A contrastive log-ratio upper bound of mutual information,” in ICML, 2020.
  42. D. Wang, L. Deng, Y. T. Yeung, X. Chen, X. Liu, and H. Meng, “VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion,” in Interspeech, 2021, pp. 1344–1348.
  43. Y. Zhang, H. Che, and X. Wang, “Non-parallel sequence-to-sequence voice conversion for arbitrary speakers,” in ISCSLP, 2021, pp. 1–5.
  44. D. Y. Park and K. H. Lee, “Arbitrary style transfer with style-attentional networks,” in CVPR, 2019, pp. 5880–5888.
  45. Y. J. M. K. Veaux Christophe, “Superseded - cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
  46. I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with warm restarts,” in ICLR, 2017.
  47. A. Gotmare, N. S. Keskar, C. Xiong, and R. Socher, “A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation,” in ICLR, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ziqi Liang (10 papers)
  2. Jianzong Wang (144 papers)
  3. Xulong Zhang (60 papers)
  4. Yong Zhang (660 papers)
  5. Ning Cheng (96 papers)
  6. Jing Xiao (267 papers)