Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

CoMoSVC: Consistency Model-based Singing Voice Conversion (2401.01792v1)

Published 3 Jan 2024 in eess.AS, cs.AI, cs.LG, and cs.SD

Abstract: The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at https://comosvc.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Pitchnet: Unsupervised singing voice conversion with pitch adversarial network. In Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2020.
  2. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. In Proc. ACM Int. Conf. on Multimedia (ACM MM), 2021.
  3. Intl. Telecommunications Union (ITU-T). Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. Recommendation P.862, Intl. Telecommunications Union (ITU-T), February 2001.
  4. Elucidating the design space of diffusion-based generative models. In Proc. Conf. on Neural Information Processing Systems (NeurIPS), 2022.
  5. Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In Proc. InterSpeech, 2014.
  6. Statistical singing voice conversion based on direct waveform modification and its parameter generation algorithms. IEICE Tech. Rep., 115(253):7–12, 2015a.
  7. Statistical singing voice conversion based on direct waveform modification with global variance. In Proc. InterSpeech, 2015b.
  8. FastSVC: Fast cross-domain singing voice conversion with feature-wise linear modulation. In Proc. Intl. Conf. Multimedia and Expo (ICME), 2021a.
  9. Diffsvc: A diffusion probabilistic model for singing voice conversion. In Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 741–748, 2021b.
  10. Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech. In Audio Engineering Society Conference: 35th International Conference: Audio for Games, 2009.
  11. Unsupervised singing voice conversion. In Proc. InterSpeech, 2019.
  12. Unsupervised cross-domain singing voice conversion. In Proc. InterSpeech, 2020.
  13. ContentVec: An improved self-supervised speech representation by disentangling speakers. In Proc. Intl. Conf. Machine Learning (ICML), 2022.
  14. Robust speech recognition via large-scale weak supervision. In Proc. Intl. Conf. Machine Learning (ICML), 2023.
  15. A wavenet for speech denoising. In Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2018.
  16. Score-based generative modeling through stochastic differential equations. In Proc. Intl. Conf. on Learning Representations (ICLR), 2021.
  17. Consistency models. In Proc. Intl. Conf. Machine Learning (ICML), 2023.
  18. SVC-Develop-Team. Softvc vits singing voice conversion. https://github.com/svc-develop-team/so-vits-svc, 2023.
  19. Comospeech: One-step speech and singing voice synthesis via consistency model. In Proc. ACM Int. Conf. on Multimedia (ACM MM), 2023.
  20. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. In Proc. Conf. on Neural Information Processing Systems (NeurIPS), 2022.
  21. Leveraging content-based features from multiple acoustic models for singing voice conversion. arXiv preprint arXiv:2310.11160, 2023.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com