Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Disentangled Speech Representations with Contrastive Learning and Time-Invariant Retrieval (2401.08096v2)

Published 16 Jan 2024 in cs.SD and eess.AS

Abstract: Voice conversion refers to transferring speaker identity with well-preserved content. Better disentanglement of speech representations leads to better voice conversion. Recent studies have found that phonetic information from input audio has the potential ability to well represent content. Besides, the speaker-style modeling with pre-trained models making the process more complex. To tackle these issues, we introduce a new method named "CTVC" which utilizes disentangled speech representations with contrastive learning and time-invariant retrieval. Specifically, a similarity-based compression module is used to facilitate a more intimate connection between the frame-level hidden features and linguistic information at phoneme-level. Additionally, a time-invariant retrieval is proposed for timbre extraction based on multiple segmentations and mutual information. Experimental results demonstrate that "CTVC" outperforms previous studies and improves the sound quality and similarity of converted results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “A comparative study of self-supervised speech representation based voice conversion,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1308–1318, 2022.
  2. “Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in ICASSP 2022. IEEE, 2022, pp. 6332–6336.
  3. “Pmvc: Data augmentation-based prosody modeling for expressive voice conversion,” in 31st ACM International Conference on Multimedia, 2023.
  4. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  5. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  6. “A comparison of discrete and soft speech units for improved voice conversion,” in ICASSP 2022. IEEE, 2022, pp. 6562–6566.
  7. “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
  8. “Generalized end-to-end loss for speaker verification,” in ICASSP 2018. IEEE, 2018, pp. 4879–4883.
  9. “X-vectors: Robust dnn embeddings for speaker recognition,” in ICASSP 2018. IEEE, 2018, pp. 5329–5333.
  10. “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in ICML 2019, 2019, pp. 5210–5219.
  11. “Avqvc: One-shot voice conversion by vector quantization with applying contrastive learning,” in ICASSP 2022. IEEE, 2022, pp. 4613–4617.
  12. “Tgavc: Improving autoencoder voice conversion with text-guided and adversarial training,” in ASRU 2021. IEEE, 2021, pp. 938–945.
  13. “One-shot voice conversion with global speaker embeddings.,” in Interspeech 2019, 2019, pp. 669–673.
  14. “Disentangling content and fine-grained prosody information via hybrid asr bottleneck features for voice conversion,” in ICASSP 2022. IEEE, 2022, pp. 7022–7026.
  15. “What makes for good views for contrastive learning?,” NeurIPs 2020, vol. 33, pp. 6827–6839, 2020.
  16. “Montreal forced aligner: Trainable text-speech alignment using kaldi.,” in Interspeech 2017, 2017, pp. 498–502.
  17. “Improving zero-shot voice style transfer via disentangled representation learning,” in ICLR 2021, 2021.
  18. “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018.
  19. “Vqvc+: One-shot voice conversion by vector quantization and u-net architecture,” in Interspeech 2020, 2020, pp. 4691–4695.
  20. “F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder,” in ICASSP 2020. IEEE, 2020, pp. 6284–6288.
  21. “Learning speech representations with flexible hidden feature dimensions,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
  22. “AISHELL-3: A multi-speaker mandarin TTS corpus,” in Interspeech 2021, 2021, pp. 2756–2760.
  23. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in NeruIPS 2020, 2020.
Citations (1)

Summary

We haven't generated a summary for this paper yet.