The Singing Voice Conversion Challenge 2023 (2306.14422v2)
Abstract: We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different voice conversion (VC) systems based on a common dataset. This year we shifted our focus to singing voice conversion (SVC), thus named the challenge the Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely in-domain and cross-domain SVC. The challenge was run for two months, and in total we received 26 submissions, including 2 baselines. Through a large-scale crowd-sourced listening test, we observed that for both tasks, although human-level naturalness was achieved by the top system, no team was able to obtain a similarity score as high as the target speakers. Also, as expected, cross-domain SVC is harder than in-domain SVC, especially in the similarity aspect. We also investigated whether existing objective measurements were able to predict perceptual performance, and found that only few of them could reach a significant correlation.
- S. H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Communication, vol. 88, pp. 65–82, 2017.
- “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM TASLP, vol. 29, pp. 132–157, 2021.
- “Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,” Speech Communication, vol. 54, no. 1, pp. 134–146, 2012.
- T. Toda, “Augmented speech production based on real-time statistical voice conversion,” in IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2014, pp. 592–596.
- “Enhancing foreign language tutors – in search of the golden speaker,” Speech Communication, vol. 37, no. 3, pp. 161–173, 2002.
- “Voice expression conversion with factorised hmm-tts models,” in Proc. Interspeech, 2014, pp. 1514–1518.
- “Statistical voice conversion techniques for body-conducted unvoiced speech enhancement,” IEEE/ACM TASLP, vol. 20, no. 9, pp. 2505–2517, 2012.
- “The voice conversion challenge 2016,” in Proc. Interspeech, 2016, pp. 1632–1636.
- “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” in Proc. Odyssey The Speaker and Language Recognition Workshop, 2018, pp. 195–202.
- “Voice Conversion Challenge 2020 - Intra-lingual semi-parallel and cross-lingual voice conversion -,” in Proc. Joint Workshop for the BC and VCC 2020, 2020, pp. 80–98.
- “The Blizzard Challenge 2020,” in Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020, pp. 1–18.
- “Statistical singing voice conversion with direct waveform modification based on the spectrum differential,” in Proc. Interspeech, 2014.
- “Statistical singing voice conversion based on direct waveform modification with global variance,” in Proc. Interspeech, 2015.
- “Singan: Singing voice conversion with generative adversarial networks,” in Proc. APSIPA ASC), 2019, pp. 112–118.
- B. Sisman and H. Li, “Generative Adversarial Networks for Singing Voice Conversion with and without Parallel Data,” in Proc. Odyssey The Speaker and Language Recognition Workshop, 2020, pp. 238–244.
- “StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion,” in Proc. Interspeech, 2021, pp. 1349–1353.
- E. Nachmani and L. Wolf, “Unsupervised singing voice conversion,” in Proc. Interspeech, 2019.
- “Pitchnet: Unsupervised singing voice conversion with pitch adversarial network,” in Proc. ICASSP, 2020, pp. 7749–7753.
- “Ppg-based singing voice conversion with adversarial representation learning,” in Proc. ICASSP, 2021, pp. 7073–7077.
- “Unsupervised cross-domain singing voice conversion,” in Proc. Interspeech, 2020.
- “FastSVC: Fast cross-domain singing voice conversion with feature-wise linear modulation,” in Proc. ICME, 2021, pp. 1–6.
- Y. Zhou and X. Lu, “HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion,” in Proc. ICASSP, 2022, pp. 6667–6671.
- “Diffsvc: A diffusion probabilistic model for singing voice conversion,” in Proc, ASRU, 2021, pp. 741–748.
- “NHSS: A speech and singing parallel database,” Speech Communication, vol. 133, pp. 9–22, 2021.
- “A comparative study of self-supervised speech representation based voice conversion,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1308–1318, 2022.
- “HuBERT: How Much Can a Bad Teacher Benefit ASR Pre-Training?,” in Proc. ICASSP, 2021, pp. 6533–6537.
- “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- “ContentVec: An improved self-supervised speech representation by disentangling speakers,” in Proc. ICML, Jul 2022, vol. 162, pp. 18003–18017.
- “PortaSpeech: Portable and High-Quality Generative Text-to-Speech,” in Proc. NeruIPS, 2021, vol. 34, pp. 13963–13974.
- “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in Proc. ICML, Jul 2021, vol. 139, pp. 5530–5540.
- “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” in Proc. NeurIPS, 2020, vol. 33, pp. 17022–17033.
- “Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis,” IEEE/ACM TASLP, vol. 28, pp. 402–415, 2020.
- “BigVGAN: A universal neural vocoder with large-scale training,” in Proc. ICLR, 2023.
- “Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder,” in Proc. ICASSP, 2023, pp. 1–5.
- “DSPGAN: A Gan-Based Universal Vocoder for High-Fidelity TTS by Time-Frequency Domain Supervision from DSP,” in Proc. ICASSP, 2023, pp. 1–5.
- F. Villavicencio and J. Bonada, “Applying voice conversion to concatenative singing-voice synthesis,” in Proc. Interspeech, 2010.
- “Singing voice conversion method based on many-to-many eigenvoice conversion and training data generation using a singing-to-singing synthesis system,” in Proc. APSIPA ASC, 2012, pp. 1–6.
- “VAW-GAN for singing voice conversion with non-parallel training data,” in Proc. APSIPA ASC, 2020, pp. 514–519.
- “Muskits: an end-to-end music processing toolkit for singing voice synthesis,” in Proc. Interspeech, 2022, pp. 4277–4281.
- “Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” Proc. Interspeech, pp. 4676–4680, 2020.
- “Pretraining techniques for sequence-to-sequence voice conversion,” IEEE/ACM TASLP, vol. 29, pp. 745–755, 2021.
- “Effectiveness of transfer learning on singing voice conversion in the presence of background music,” in Proc. International Conference on Signal Processing and Communications (SPCOM), 2020, pp. 1–5.
- G. Roa Dabike and J. Barker, “Automatic lyric transcription from karaoke vocal tracks: Resources and a baseline system,” in Proc. Interspeech, 2019.
- “ESPnet: End-to-end speech processing toolkit,” Proc. Interspeech, pp. 2207–2211, 2018.
- “Recent developments on espnet toolkit boosted by conformer,” in Proc. ICASSP, 2021, pp. 5874–5878.
- “SUPERB: Speech processing Universal PERformance Benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198.
- “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
- “Singing voice conversion with disentangled representations of singer and vocal technique using variational autoencoders,” in Proc. ICASSP, 2020, pp. 3277–3281.
- “Pushing the limits of raw waveform speaker recognition,” Proc. Interspeech, 2022.
- “Generalization ability of mos prediction networks,” in Proc. ICASSP, 2022, pp. 8442–8446.
- “UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
- “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
- “OpenCPop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis,” in Proc. Interspeech, 2022, pp. 4242–4246.
- “Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus,” in Proc. ACM MM, 2021, pp. 3945–3954.
- “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017.
- “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in Proc. APSIPA ASC, 2013, pp. 1–9.
- “M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,” in Proc. NeruIPS: Datasets and Benchmarks Track, 2022.
- “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
- “Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation,” in Proc. Interspeech, 2022, pp. 848–852.
- “A comparison of discrete and soft speech units for improved voice conversion,” in Proc. ICASSP, 2022.
- “HiFiSinger: Towards high-fidelity neural singing voice synthesis,” arXiv preprint arXiv:2009.01776, 2020.