Papers
Topics
Authors
Recent
2000 character limit reached

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders (2409.01995v4)

Published 3 Sep 2024 in eess.AS, cs.AI, and cs.SD

Abstract: We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice conversion (VC). We use discrete tokens from speech self-supervised models as the content features of source speech, and treat VC as a prompted vocoding task. To amend the loss of speaker timbre in the content tokens, vec2wav 2.0 utilizes the WavLM features to provide strong timbre-dependent information. A novel adaptive Snake activation function is proposed to better incorporate timbre into the waveform reconstruction process. In this way, vec2wav 2.0 learns to alter the speaker timbre appropriately given different reference prompts. Also, no supervised data is required for vec2wav 2.0 to be effectively trained. Experimental results demonstrate that vec2wav 2.0 outperforms all other baselines to a considerable margin in terms of audio quality and speaker similarity in any-to-any VC. Ablation studies verify the effects made by the proposed techniques. Moreover, vec2wav 2.0 achieves competitive cross-lingual VC even only trained on monolingual corpus. Thus, vec2wav 2.0 shows timbre can potentially be manipulated only by speech token vocoders, pushing the frontiers of VC and speech synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. C. Du, Y. Guo, X. Chen, and K. Yu, “VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature,” in Proc. ISCA Interspeech, 2022, pp. 1596–1600.
  2. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  3. C. Du, Y. Guo, F. Shen, Z. Liu, Z. Liang, X. Chen, S. Wang, H. Zhang, and K. Yu, “UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding,” in Proc. AAAI, vol. 38, no. 16, 2024, pp. 17 924–17 932.
  4. Z. Ju, Y. Wang, K. Shen et al., “NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models,” in Proc. ICML, 2024.
  5. Y. Yang, F. Shen, C. Du, Z. Ma, K. Yu, D. Povey, and X. Chen, “Towards universal speech discrete tokens: A case study for ASR and TTS,” in Proc. IEEE ICASSP, 2024, pp. 10 401–10 405.
  6. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High Fidelity Neural Audio Compression,” Transactions on Machine Learning Research, 2023.
  7. R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,” Proc. NeurIPS, vol. 36, 2024.
  8. A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations,” in Proc. ICLR, 2020.
  9. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Trans. ASLP., vol. 29, pp. 3451–3460, 2021.
  10. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Proc. NeurIPS, vol. 33, pp. 12 449–12 460, 2020.
  11. S. Chen, C. Wang, Z. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  12. W.-C. Huang, S.-W. Yang, T. Hayashi, H.-Y. Lee, S. Watanabe, and T. Toda, “S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations,” in Proc. IEEE ICASSP, 2022, pp. 6552–6556.
  13. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss,” in Proc. ICML.   PMLR, 2019, pp. 5210–5219.
  14. K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised Speech Decomposition via Triple Information Bottleneck,” in Proc. ICML.   PMLR, 2020, pp. 7836–7846.
  15. C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “SpeechSplit2.0: Unsupervised Speech Disentanglement for Voice Conversion without Tuning Autoencoder Bottlenecks,” in Proc. IEEE ICASSP, 2022, pp. 6332–6336.
  16. E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone,” in Proc. ICML.   PMLR, 2022, pp. 2709–2720.
  17. T. Merritt, A. Ezzerg, P. Biliński, M. Proszewska, K. Pokora, R. Barra-Chicote, and D. Korzekwa, “Text-Free Non-Parallel Many-to-Many Voice Conversion Using Normalising Flow,” in Proc. IEEE ICASSP, 2022, pp. 6782–6786.
  18. J. Li, W. Tu, and L. Xiao, “FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion,” in Proc. IEEE ICASSP, 2023.
  19. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, M. S. Kudinov, and J. Wei, “Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme,” in Proc. ICLR, 2022.
  20. H.-Y. Choi, S.-H. Lee, and S.-W. Lee, “Diff-HierVC: Diffusion-Based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-Shot Speaker Adaptation,” Proc. ISCA Interspeech, pp. 2283–2287, 2023.
  21. ——, “DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion,” in Proc. AAAI, vol. 38, no. 16, 2024, pp. 17 862–17 870.
  22. S. Hussain, P. Neekhara, J. Huang, J. Li, and B. Ginsburg, “ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-Supervised Speech Representations,” in Proc. IEEE ICASSP, 2023.
  23. M. Baas, B. van Niekerk, and H. Kamper, “Voice Conversion With Just Nearest Neighbors,” in Proc. ISCA Interspeech, 2023, pp. 2053–2057.
  24. P. Neekhara, S. Hussain, R. Valle, B. Ginsburg, R. Ranjan, S. Dubnov, F. Koushanfar, and J. McAuley, “SelfVC: Voice Conversion with Iterative Refinement Using Self Transformations,” Proc. ICML, 2024.
  25. A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W.-N. Hsu, A. Mohamed, and E. Dupoux, “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations,” in Proc. ISCA Interspeech, 2021, pp. 3615–3619.
  26. L.-W. Chen, S. Watanabe, and A. Rudnicky, “A Unified One-Shot Prosody and Speaker Conversion System with Self-Supervised Discrete Speech Units,” in Proc. IEEE ICASSP, 2023.
  27. L. Ma, X. Zhu, Y. Lv, Z. Wang, Z. Wang, W. He, H. Zhou, and L. Xie, “Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy,” in Proc. ISCA Interspeech, 2024, pp. 2745–2749.
  28. K. Qian, Y. Zhang, H. Gao, J. Ni, C.-I. Lai, D. Cox, M. Hasegawa-Johnson, and S. Chang, “Contentvec: An improved self-supervised speech representation by disentangling speakers,” in Proc. ICML.   PMLR, 2022, pp. 18 003–18 017.
  29. Á. Martín-Cortinas, D. Sáez-Trigueros, I. Vallés-Pérez, B. Tura-Vecino, P. Biliński, M. Lajszczak, G. Beringer, R. Barra-Chicote, and J. Lorenzo-Trueba, “Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations,” arXiv preprint arXiv:2402.03407, 2024.
  30. J. Li, Y. Guo, X. Chen, and K. Yu, “SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention,” in Proc. IEEE ICASSP, 2024, pp. 12 296–12 300.
  31. S. gil Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A Universal Neural Vocoder with Large-Scale Training,” in Proc. ICLR, 2023.
  32. L. Ziyin, T. Hartwig, and M. Ueda, “Neural Networks Fail to Learn Periodic Functions and How to Fix It,” Proc. NeurIPS, vol. 33, pp. 1583–1594, 2020.
  33. S. Liu, Y. Guo, C. Du, X. Chen, and K. Yu, “DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech,” in Proc. ISCA Interspeech, 2023, pp. 616–620.
  34. C. Du, Y. Guo, F. Shen, and K. Yu, “Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge,” in Proc. IEEE ICASSP, 2023, pp. 1–2.
  35. S. wen Yang, P.-H. Chi, Y.-S. Chuang et al., “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. ISCA Interspeech, 2021, pp. 1194–1198.
  36. C. Du, Y. Guo, X. Chen, and K. Yu, “Speaker Adaptive Text-to-Speech With Timbre-Normalized Vector-Quantized Feature,” IEEE/ACM Trans. ASLP., vol. 31, pp. 3446–3456, 2023.
  37. J. weon Jung, W. Zhang, J. Shi, Z. Aldeneh, T. Higuchi, A. Gichamba, B.-J. Theobald, A. Hussen Abdelaziz, and S. Watanabe, “ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models,” in Proc. ISCA Interspeech, 2024, pp. 4278–4282.
  38. J. Kong, J. Kim, and J. Bae, “Hifi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” Proc. NeurIPS, vol. 33, pp. 17 022–17 033, 2020.
  39. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” in Proc. ISCA Interspeech, 2019, pp. 1526–1530.
  40. V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” in Proc. ISCA Interspeech, 2020, pp. 2757–2761.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Video Overview

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 10 likes about this paper.