Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Singing Voice Conversion Challenge 2023 (2306.14422v2)

Published 26 Jun 2023 in cs.SD, cs.CL, and eess.AS

Abstract: We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different voice conversion (VC) systems based on a common dataset. This year we shifted our focus to singing voice conversion (SVC), thus named the challenge the Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely in-domain and cross-domain SVC. The challenge was run for two months, and in total we received 26 submissions, including 2 baselines. Through a large-scale crowd-sourced listening test, we observed that for both tasks, although human-level naturalness was achieved by the top system, no team was able to obtain a similarity score as high as the target speakers. Also, as expected, cross-domain SVC is harder than in-domain SVC, especially in the similarity aspect. We also investigated whether existing objective measurements were able to predict perceptual performance, and found that only few of them could reach a significant correlation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Communication, vol. 88, pp. 65–82, 2017.
  2. “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM TASLP, vol. 29, pp. 132–157, 2021.
  3. “Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,” Speech Communication, vol. 54, no. 1, pp. 134–146, 2012.
  4. T. Toda, “Augmented speech production based on real-time statistical voice conversion,” in IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2014, pp. 592–596.
  5. “Enhancing foreign language tutors – in search of the golden speaker,” Speech Communication, vol. 37, no. 3, pp. 161–173, 2002.
  6. “Voice expression conversion with factorised hmm-tts models,” in Proc. Interspeech, 2014, pp. 1514–1518.
  7. “Statistical voice conversion techniques for body-conducted unvoiced speech enhancement,” IEEE/ACM TASLP, vol. 20, no. 9, pp. 2505–2517, 2012.
  8. “The voice conversion challenge 2016,” in Proc. Interspeech, 2016, pp. 1632–1636.
  9. “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” in Proc. Odyssey The Speaker and Language Recognition Workshop, 2018, pp. 195–202.
  10. “Voice Conversion Challenge 2020 - Intra-lingual semi-parallel and cross-lingual voice conversion -,” in Proc. Joint Workshop for the BC and VCC 2020, 2020, pp. 80–98.
  11. “The Blizzard Challenge 2020,” in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020, pp. 1–18.
  12. “Statistical singing voice conversion with direct waveform modification based on the spectrum differential,” in Proc. Interspeech, 2014.
  13. “Statistical singing voice conversion based on direct waveform modification with global variance,” in Proc. Interspeech, 2015.
  14. “Singan: Singing voice conversion with generative adversarial networks,” in Proc. APSIPA ASC), 2019, pp. 112–118.
  15. B. Sisman and H. Li, “Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data,” in Proc. Odyssey The Speaker and Language Recognition Workshop, 2020, pp. 238–244.
  16. “StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion,” in Proc. Interspeech, 2021, pp. 1349–1353.
  17. E. Nachmani and L. Wolf, “Unsupervised singing voice conversion,” in Proc. Interspeech, 2019.
  18. “Pitchnet: Unsupervised singing voice conversion with pitch adversarial network,” in Proc. ICASSP, 2020, pp. 7749–7753.
  19. “Ppg-based singing voice conversion with adversarial representation learning,” in Proc. ICASSP, 2021, pp. 7073–7077.
  20. “Unsupervised cross-domain singing voice conversion,” in Proc. Interspeech, 2020.
  21. “FastSVC: Fast cross-domain singing voice conversion with feature-wise linear modulation,” in Proc. ICME, 2021, pp. 1–6.
  22. Y. Zhou and X. Lu, “HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion,” in Proc. ICASSP, 2022, pp. 6667–6671.
  23. “Diffsvc: A diffusion probabilistic model for singing voice conversion,” in Proc, ASRU, 2021, pp. 741–748.
  24. “NHSS: A speech and singing parallel database,” Speech Communication, vol. 133, pp. 9–22, 2021.
  25. “A comparative study of self-supervised speech representation based voice conversion,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1308–1318, 2022.
  26. “HuBERT: How Much Can a Bad Teacher Benefit ASR Pre-Training?,” in Proc. ICASSP, 2021, pp. 6533–6537.
  27. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  28. “ContentVec: An improved self-supervised speech representation by disentangling speakers,” in Proc. ICML, Jul 2022, vol. 162, pp. 18003–18017.
  29. “PortaSpeech: Portable and High-Quality Generative Text-to-Speech,” in Proc. NeruIPS, 2021, vol. 34, pp. 13963–13974.
  30. “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in Proc. ICML, Jul 2021, vol. 139, pp. 5530–5540.
  31. “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” in Proc. NeurIPS, 2020, vol. 33, pp. 17022–17033.
  32. “Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis,” IEEE/ACM TASLP, vol. 28, pp. 402–415, 2020.
  33. “BigVGAN: A universal neural vocoder with large-scale training,” in Proc. ICLR, 2023.
  34. “Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder,” in Proc. ICASSP, 2023, pp. 1–5.
  35. “DSPGAN: A Gan-Based Universal Vocoder for High-Fidelity TTS by Time-Frequency Domain Supervision from DSP,” in Proc. ICASSP, 2023, pp. 1–5.
  36. F. Villavicencio and J. Bonada, “Applying voice conversion to concatenative singing-voice synthesis,” in Proc. Interspeech, 2010.
  37. “Singing voice conversion method based on many-to-many eigenvoice conversion and training data generation using a singing-to-singing synthesis system,” in Proc. APSIPA ASC, 2012, pp. 1–6.
  38. “VAW-GAN for singing voice conversion with non-parallel training data,” in Proc. APSIPA ASC, 2020, pp. 514–519.
  39. “Muskits: an end-to-end music processing toolkit for singing voice synthesis,” in Proc. Interspeech, 2022, pp. 4277–4281.
  40. “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676–4680, 2020.
  41. “Pretraining techniques for sequence-to-sequence voice conversion,” IEEE/ACM TASLP, vol. 29, pp. 745–755, 2021.
  42. “Effectiveness of transfer learning on singing voice conversion in the presence of background music,” in Proc. International Conference on Signal Processing and Communications (SPCOM), 2020, pp. 1–5.
  43. G. Roa Dabike and J. Barker, “Automatic lyric transcription from karaoke vocal tracks: Resources and a baseline system,” in Proc. Interspeech, 2019.
  44. “ESPnet: End-to-end speech processing toolkit,” Proc. Interspeech, pp. 2207–2211, 2018.
  45. “Recent developments on espnet toolkit boosted by conformer,” in Proc. ICASSP, 2021, pp. 5874–5878.
  46. “SUPERB: Speech processing Universal PERformance Benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198.
  47. “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
  48. “Singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders,” in Proc. ICASSP, 2020, pp. 3277–3281.
  49. “Pushing the limits of raw waveform speaker recognition,” Proc. Interspeech, 2022.
  50. “Generalization ability of mos prediction networks,” in Proc. ICASSP, 2022, pp. 8442–8446.
  51. “UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
  52. “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
  53. “OpenCPop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis,” in Proc. Interspeech, 2022, pp. 4242–4246.
  54. “Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus,” in Proc. ACM MM, 2021, pp. 3945–3954.
  55. “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017.
  56. “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in Proc. APSIPA ASC, 2013, pp. 1–9.
  57. “M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,” in Proc. NeruIPS: Datasets and Benchmarks Track, 2022.
  58. “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
  59. “Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation,” in Proc. Interspeech, 2022, pp. 848–852.
  60. “A comparison of discrete and soft speech units for improved voice conversion,” in Proc. ICASSP, 2022.
  61. “HiFiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020.
Citations (39)

Summary

We haven't generated a summary for this paper yet.