kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization (2504.05686v1)

Published 8 Apr 2025 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: Robustness is critical in zero-shot singing voice conversion (SVC). This paper introduces two novel methods to strengthen the robustness of the kNN-VC framework for SVC. First, kNN-VC's core representation, WavLM, lacks harmonic emphasis, resulting in dull sounds and ringing artifacts. To address this, we leverage the bijection between WavLM, pitch contours, and spectrograms to perform additive synthesis, integrating the resulting waveform into the model to mitigate these issues. Second, kNN-VC overlooks concatenative smoothness, a key perceptual factor in SVC. To enhance smoothness, we propose a new distance metric that filters out unsuitable kNN candidates and optimize the summing weights of the candidates during inference. Although our techniques are built on the kNN-VC framework for implementation convenience, they are broadly applicable to general concatenative neural synthesis models. Experimental results validate the effectiveness of these modifications in achieving robust SVC. Demo: http://knnsvc.com Code: https://github.com/SmoothKen/knn-svc

Collections

Sign up for free to add this paper to a collection.

Sign Up

Summary

Robust Zero-Shot Singing Voice Conversion in the kNN-SVC Framework

This paper introduces a robust methodology for zero-shot singing voice conversion (SVC) by advancing the kNN framework. The work is marked by two innovative techniques: the incorporation of additive synthesis to enhance harmonic richness and the optimization of concatenation smoothness to improve perceptual quality. These methods are integrated into the kNN-SVC framework to address the limitations of its predecessor, kNN-VC, and improve upon existing SVC solutions.

The paper identifies two critical issues in conventional singing voice conversion models. Firstly, the kNN-VC's representation lacks harmonic emphasis necessary for SVC. This limitation arises from using WavLM representations, which are insufficiently attuned to capture the nuances of pitch and timbre prevalent in singing. Secondly, the notion of concatenation smoothness in SVC, a perceptual necessity, is often overlooked. The authors propose remedies that not only uphold but enhance the robustness and applicability of the kNN-VC framework in broader contexts of neural synthesis.

Additive Synthesis for Enhanced Harmonic Representation

The first major contribution involves introducing additive synthesis (AS) to address the harmonic shortcomings. By leveraging the bijective relationship between WavLM features, pitch contours, and spectrograms, the method inserts harmonic information directly. This process enhances the naturalness and clarity of the converted singing voice by embedding these harmonic-rich synthesized waveforms into the model via convolutional layers. The results demonstrate an overall improvement in voice quality and fidelity, as verified through experimental evaluations.

Concatenation Smoothness Optimization

The second proposed technique focuses on concatenation smoothness, pivotal to seamless voice conversion. The authors present a novel distance metric that incorporates temporal concatenation costs, refining the selection of nearest neighbor candidates. This approach, named Concatenation Smoothness Optimization (CSO), involves an autoregressive reselection process coupled with weight optimization strategies. Autoregressive candidate selection ensures temporal coherence, while optimized weight summing aligns the selected candidates, minimizing perceptual artifacts such as slurring or trembling.

Experimental Evaluation and Results

The authors conduct rigorous experiments demonstrating these advances across multiple datasets, including LibriSpeech for speech conversion and OpenSinger and NUS48E for singing voice tasks. Objective measures, such as EER, and subjective metrics, like MOS and SIM, reveal that the augmented kNN-SVC model achieves superior performance compared to its antecedents. Particularly noteworthy is the reported enhancement in speaker similarity metrics, illustrating the model's ability to maintain audio quality and integrity even under zero-shot conditions.

Practical and Theoretical Implications

The proposed advancements offer significant implications for both practical applications and theoretical developments in AI-driven audio processing. Practically, these improvements cater to a growing industry need for high-fidelity voice conversion tools without requiring extensive training data for new speakers. Theoretically, the model's reliance on non-parametric methods, such as kNN and additive synthesis, challenges existing paradigms that tend to focus heavily on parametric approaches and could set a precedent for future research into non-linear, self-supervised learning techniques in audio synthesis.

Future Directions

This research opens avenues for further exploration into the potential of non-parametric models and their integration with advanced machine learning systems. Future research could explore adapting these methods for broader linguistic datasets and exploring other aspects of voice conversion, such as emotionality and expression, further enriching the listener's experience. Additionally, more complex distance metrics and optimization strategies may be developed to refine candidate selection processes further, enhancing the coherence and naturalness of the synthesized output.

Overall, this paper's contributions reflect a considered and methodical advance in zero-shot singing voice conversion, addressing core issues and proposing solutions with measurable improvements in synthesized voice quality and robustness, setting a new standard in the field.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (4)

YouTube

Show All Videos