Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing (2401.17619v3)

Published 31 Jan 2024 in cs.SD and eess.AS

Abstract: In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatural voice synthesis. This innovative method has led to the creation of two expansive singing voice datasets, ACE-Opencpop and ACE-KiSing, which are instrumental for large-scale, multi-singer voice synthesis. Through thorough experimentation, we establish that these datasets not only serve as new benchmarks for SVS but also enhance SVS performance on other singing voice datasets when used as supplementary resources. The corpora, pre-trained models, and their related training recipes are publicly available at ESPnet-Muskits (\url{https://github.com/espnet/espnet})

PDF Abstract

Overview of "Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2"

This paper addresses the persistent challenges in singing voice synthesis (SVS), particularly concerning data scarcity. Unlike text-to-speech systems, SVS often contends with limited data due to copyright restrictions and the need for specialized recording environments. To overcome these challenges, the authors present two datasets: ACE-Opencpop and KiSing-v2, which aim to enhance SVS capabilities through data augmentation and careful manual tuning.

The authors utilize an existing singing voice synthesizer for data augmentation, meticulously reducing the artifacts of unnatural voice synthesis through manual adjustments. These corpora are substantial in scale, featuring ACE-Opencpop with approximately 130 hours of recordings from 30 singers and KiSing-v2 with about 32.5 hours from 34 singers. The datasets facilitate large-scale, multi-singer voice synthesis and are fully accessible for use in further research.

Pre-trained models developed from these corpora demonstrated marked improvements in voice quality across both in-domain and out-domain contexts, establishing the potential for ACE-Opencpop and KiSing-v2 to substantially contribute to SVS research. These contributions include model improvements in mean cepstral distortion (MCD), semitone accuracy, and fundamental frequency (F0) RMSE, alongside subjective listening measures.

Methodology

Data Curation Process: The authors detail a rigorous data curation process involving several stages, from data preparation to information verification, voice match tuning, and singer adaptation. They emphasize the importance of manual tuning, using ACE Studio to generate the corpora, enhancing the synthesized voices with techniques such as vibrato and syllable duration adjustments.

Corpora Description: ACE-Opencpop, built on the Opencpop benchmark, retains its song list and statistical features while scaling up the data size. KiSing-v2 extends the previous KiSing database, offering bilingual content and emphasizing the melisma singing technique, providing an intriguing dimension for analysis.

Comparative Analysis: The corpora are compared against existing benchmarks, and it is clear that ACE-Opencpop and KiSing-v2 offer significantly larger datasets and singer diversity. ACE-Opencpop in particular, with over 100 hours of data, surpasses previous datasets, serving as a robust resource for large-scale SVS research.

Experimental Evaluation

The paper provides extensive experimental evaluations to validate the datasets. Four different SVS models, including RNN, Xiaoice, VISinger, and VISinger2, are employed. Results indicate that the pre-trained models derived from ACE-Opencpop improve both objective and subjective metrics in SVS tasks, particularly in scenarios involving transfer learning.

Key experimental findings show improved semitone accuracy and MOS when applying these models to different datasets, affirming their effectiveness in enhancing SVS applications. However, the paper notes the remaining gap between synthesized and actual human singing performance, indicating areas for continued research.

Implications and Future Directions

The implications of this work lie in its contribution towards mitigating the data scarcity problem in SVS. By sharing the datasets and models openly, the authors enable further research and potential improvements in multi-singer and cross-domain SVS systems. This work paves the way for more nuanced synthesis techniques that incorporate diverse singing styles and languages.

Future developments may involve refining models to further close the performance gap between synthesized and human singing. Additionally, leveraging advanced models and pre-training techniques can explore new frontiers in SVS, such as real-time synthesis and broader style adaptations.

The paper represents a significant step towards scalable, high-quality singing voice synthesis, providing valuable resources and insights for ongoing advancements in artificial intelligence-driven content generation.