Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion (2208.08757v1)

Published 18 Aug 2022 in eess.AS, cs.LG, and cs.SD

Abstract: One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at https://im1eon.github.io/IS2022-SRDVC/.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Methawee Tantrawenith (1 paper)
  2. Haolin Zhuang (6 papers)
  3. Zhiyong Wu (171 papers)
  4. Aolan Sun (6 papers)
  5. Jianzong Wang (144 papers)
  6. Ning Cheng (96 papers)
  7. Huaizhen Tang (5 papers)
  8. Xintao Zhao (8 papers)
  9. Jie Wang (480 papers)
  10. Helen Meng (204 papers)
  11. Sicheng Yang (20 papers)
Citations (34)