DeepSinger: Singing Voice Synthesis with Data Mined From the Web (2007.04590v2)

Published 9 Jul 2020 in eess.AS, cs.CL, and cs.SD

Abstract: In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in https://speechresearch.github.io/deepsinger/.)

PDF Abstract

DeepSinger: Singing Voice Synthesis with Data Mined From the Web

The paper introduces DeepSinger, a multi-lingual, multi-singer Singing Voice Synthesis (SVS) system, which stands out due to its novel approach of utilizing singing data mined directly from music websites. DeepSinger is engineered through a rigorous pipeline encompassing data mining, voice separation, lyrics alignment, and advanced singing modeling. This approach essentially democratizes access to singing data by circumventing the traditionally high costs associated with manually curated and labeled datasets.

Methodological Contributions

DeepSinger's methodological contributions are distinct in their handling of several challenges inherent in SVS:

Data Acquisition: Unlike traditional SVS systems that rely on studio-quality recordings, DeepSinger accumulates its training data from the web. The mined dataset consists of approximately 92 hours of singing data from 89 singers across three languages: Chinese, Cantonese, and English.
Lyrics-to-Singing Alignment: A crucial innovation lies in their alignment model that automates the extraction of phoneme duration from lyrics. Anti-aliasing human effort, this model uses automatic speech recognition techniques to align audio with lyrics without manual interference.
Singing Model Architecture: Built on a feed-forward Transformer framework, their singing model directly synthesizes linear-spectrograms, simplifying the traditionally complicated acoustic modeling. It introduces a reference encoder capturing singer timbre from noisy data, enhancing the system's robustness against suboptimal input quality.

Numerical Results and System Evaluation

The performance evaluation of DeepSinger spans qualitative and quantitative analysis. Quantitatively, it achieves high pitch accuracy exceeding 85% across all languages, which is noteworthy given that it approaches the upper limit of accuracy achievable by human singers (around 80%). Qualitatively, DeepSinger garners competitive mean opinion scores (MOS) for voice naturalness, demonstrating minor degradation compared to the upper bounds set by synthesizing from ground-truth linear-spectrograms using Griffin-Lim.

Implications and Future Directions

The implications of DeepSinger’s approach to SVS transcend computational cost and data accessibility. It represents a shift towards more open and scalable data sourcing methodologies for AI systems, potentially serving as a blueprint for other domains reliant on large, annotated datasets. The ability to use reference audio to capture timbre also suggests practical applications in personalized content creation, where timbre adjustment is crucial.

Theoretically, this paper contributes to the dialogue on unsupervised and lightly-supervised learning techniques, showing promise in marrying deep learning methodologies with real-world, noisy datasets.

Looking forward, this paper sets a precedent for further research into improving the robustness of SVS systems against noise and exploring more sophisticated audio synthesis techniques such as neural vocoders superior to Griffin-Lim. Moreover, a logical progression would be to integrate cross-lingual synthesis capabilities, as the paper indicates potential in this direction.

Conclusion

In sum, the DeepSinger paper presents a comprehensive SVS system that leverages data mining, alignment optimization, and robust modeling techniques to generate high-quality multilingual singing synthesis. Its pioneering use of web-mined data paves the way for more accessible and cost-effective AI models across various research domains.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yi Ren (215 papers)
Xu Tan (164 papers)
Tao Qin (201 papers)
Jian Luan (50 papers)
Zhou Zhao (218 papers)
Tie-Yan Liu (242 papers)

Citations (69)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos