Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning the Beauty in Songs: Neural Singing Voice Beautifier (2202.13277v2)

Published 27 Feb 2022 in eess.AS, cs.CL, cs.LG, cs.MM, and cs.SD

Abstract: We are interested in a novel task, singing voice beautifying (SVB). Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre. Current automatic pitch correction techniques are immature, and most of them are restricted to intonation but ignore the overall aesthetic quality. Hence, we introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task, which adopts a conditional variational autoencoder as the backbone and learns the latent representations of vocal tone. In NSVB, we propose a novel time-warping approach for pitch correction: Shape-Aware Dynamic Time Warping (SADTW), which ameliorates the robustness of existing time-warping approaches, to synchronize the amateur recording with the template pitch curve. Furthermore, we propose a latent-mapping algorithm in the latent space to convert the amateur vocal tone to the professional one. To achieve this, we also propose a new dataset containing parallel singing recordings of both amateur and professional versions. Extensive experiments on both Chinese and English songs demonstrate the effectiveness of our methods in terms of both objective and subjective metrics. Audio samples are available at~\url{https://neuralsvb.github.io}. Codes: \url{https://github.com/MoonInTheRiver/NeuralSVB}.

Citations (11)

Summary

  • The paper presents NSVB, a novel generative model that enhances singing by simultaneously improving intonation and vocal tone while preserving timbre.
  • It employs a CVAE framework combined with Shape-Aware Dynamic Time Warping to achieve precise pitch alignment beyond traditional methods.
  • Extensive tests on Chinese and English songs demonstrate significant reductions in F0 RMSE and improved pitch alignment compared to baseline models.

Neural Singing Voice Beautifier: Advancements in Singing Voice Enhancement

The paper entitled "Learning the Beauty in Songs: Neural Singing Voice Beautifier" introduces a novel approach to the task of Singing Voice Beautifying (SVB), which emphasizes improvements in both intonation and vocal tone for amateur singers while preserving the vocal timbre and lyrical content. The paper tackles limitations in current automatic pitch correction techniques, which often solely focus on intonation correction without addressing the overall aesthetic quality of the singing voice.

This research presents the Neural Singing Voice Beautifier (NSVB), a generative model that employs a Conditional Variational Autoencoder (CVAE) framework to facilitate SVB. The model integrates a new approach to time-warping with the Shape-Aware Dynamic Time Warping (SADTW) algorithm, enhancing the robustness and alignment accuracy of pitch correction by considering the shape of the pitch curve instead of relying on low-level features.

Methodology and Model Structure

NSVB is designed around the CVAE architecture, enabling semi-supervised learning, which leverages unpaired and unlabeled data to improve the learning of latent representations of vocal tone. The research implements the SADTW algorithm specifically to align amateur recordings with reference pitch curves accurately.

Moreover, a latent-mapping algorithm is proposed to transform the amateur vocal tone latent variables into those representing a professional tone, while retaining the vocal timbre. For training the latent-mapping function, a new dataset, PopBuTFy, is introduced. This dataset consists of parallel singing recordings featuring both amateur and professional vocal qualities.

Experimental Results

Extensive experiments conducted on both Chinese and English songs demonstrate the effectiveness of NSVB. The model is evaluated using objective metrics like Mean Cepstral Distortion (MCD) and F0 Root Mean Square Error (F0 RMSE) and subjective metrics such as Mean Opinion Score (MOS) for audio and vocal tone quality. Results indicate that NSVB significantly reduces F0 RMSE, enhancing the pitch accuracy of amateur recordings.

Compared to baseline models using traditional time-warping algorithms, the NSVB exhibits superior performance in Pitch Alignment Accuracy (PAA), enhancing overall vocal tone quality in synthesized outputs. The SADTW algorithm also notably improves alignment accuracy over existing methods like Canonical Time Warping (CTW).

Implications and Future Directions

NSVB represents a substantial step towards automated singing voice enhancement, providing practical applications in the music and entertainment industries where high-quality vocal performance is essential. The ability to process and beautify amateur singing with minimal manual intervention could democratize access to professional-grade singing outputs, potentially transforming home studio production dynamics.

In terms of theoretical implications, this work opens avenues for more advanced models focusing on fine-grained aspects of vocal tone improvement and adaptive learning with unpaired data. Future developments in AI could leverage this framework to refine the separation and enhancement of various elements of singing, such as emotion, style, and articulation.

This research's introduction of novel algorithms and datasets serves as a valuable resource for the academic community exploring frontier applications of machine learning in audio processing and synthesis. As AI continues to evolve, models like NSVB will undoubtedly catalyze further exploration into personalized and context-aware audio beautification technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com