DeepSinger: Singing Voice Synthesis with Data Mined From the Web
The paper introduces DeepSinger, a multi-lingual, multi-singer Singing Voice Synthesis (SVS) system, which stands out due to its novel approach of utilizing singing data mined directly from music websites. DeepSinger is engineered through a rigorous pipeline encompassing data mining, voice separation, lyrics alignment, and advanced singing modeling. This approach essentially democratizes access to singing data by circumventing the traditionally high costs associated with manually curated and labeled datasets.
Methodological Contributions
DeepSinger's methodological contributions are distinct in their handling of several challenges inherent in SVS:
- Data Acquisition: Unlike traditional SVS systems that rely on studio-quality recordings, DeepSinger accumulates its training data from the web. The mined dataset consists of approximately 92 hours of singing data from 89 singers across three languages: Chinese, Cantonese, and English.
- Lyrics-to-Singing Alignment: A crucial innovation lies in their alignment model that automates the extraction of phoneme duration from lyrics. Anti-aliasing human effort, this model uses automatic speech recognition techniques to align audio with lyrics without manual interference.
- Singing Model Architecture: Built on a feed-forward Transformer framework, their singing model directly synthesizes linear-spectrograms, simplifying the traditionally complicated acoustic modeling. It introduces a reference encoder capturing singer timbre from noisy data, enhancing the system's robustness against suboptimal input quality.
Numerical Results and System Evaluation
The performance evaluation of DeepSinger spans qualitative and quantitative analysis. Quantitatively, it achieves high pitch accuracy exceeding 85% across all languages, which is noteworthy given that it approaches the upper limit of accuracy achievable by human singers (around 80%). Qualitatively, DeepSinger garners competitive mean opinion scores (MOS) for voice naturalness, demonstrating minor degradation compared to the upper bounds set by synthesizing from ground-truth linear-spectrograms using Griffin-Lim.
Implications and Future Directions
The implications of DeepSinger’s approach to SVS transcend computational cost and data accessibility. It represents a shift towards more open and scalable data sourcing methodologies for AI systems, potentially serving as a blueprint for other domains reliant on large, annotated datasets. The ability to use reference audio to capture timbre also suggests practical applications in personalized content creation, where timbre adjustment is crucial.
Theoretically, this paper contributes to the dialogue on unsupervised and lightly-supervised learning techniques, showing promise in marrying deep learning methodologies with real-world, noisy datasets.
Looking forward, this paper sets a precedent for further research into improving the robustness of SVS systems against noise and exploring more sophisticated audio synthesis techniques such as neural vocoders superior to Griffin-Lim. Moreover, a logical progression would be to integrate cross-lingual synthesis capabilities, as the paper indicates potential in this direction.
Conclusion
In sum, the DeepSinger paper presents a comprehensive SVS system that leverages data mining, alignment optimization, and robust modeling techniques to generate high-quality multilingual singing synthesis. Its pioneering use of web-mined data paves the way for more accessible and cost-effective AI models across various research domains.