Controllable and Interpretable Singing Voice Decomposition via Assem-VC (2110.12676v1)

Published 25 Oct 2021 in eess.AS and cs.SD

Abstract: We propose a singing decomposition system that encodes time-aligned linguistic content, pitch, and source speaker identity via Assem-VC. With decomposed speaker-independent information and the target speaker's embedding, we could synthesize the singing voice of the target speaker. In conclusion, we made a perfectly synced duet with the user's singing voice and the target singer's converted singing voice.

Summary

The paper introduces a decomposition method that encodes lyrics, rhythm, pitch, and speaker identity into interpretable formats.
It employs components like Cotatron and RAPT to achieve effective alignment and fundamental frequency estimation for enhanced synthesis fidelity.
Empirical results show that the system can modify individual vocal attributes with minimal training data, enabling versatile music production.

Essay on "Controllable and Interpretable Singing Voice Decomposition via Assem-VC"

The paper entitled "Controllable and Interpretable Singing Voice Decomposition via Assem-VC" introduces a novel approach to singing voice synthesis that enhances user control over key vocal attributes such as lyrics, rhythm, pitch, and timbre. This research presents a significant advancement in the field by addressing the limitations of prior models which required fine-grained, time-aligned inputs like MIDI, thereby restricting usability to individuals with musical expertise.

Methodological Framework

The crux of this paper lies in its innovative methodology that employs Assem-VC, a many-to-many voice conversion system, to decompose and synthesize singing voices. The decomposition involves encoding four core attributes—linguistic content, rhythm, pitch, and speaker identity—into interpretable formats. This design facilitates easy manipulation at a phoneme level without relying on temporal-aligned inputs, thereby democratizing music synthesis for users without technical knowledge of musical scores.

The architecture leverages several key components including Cotatron as the alignment encoder and RAPT for fundamental frequency estimation, where the absolute sequence of F0 is employed for enhanced interpretability. A critical aspect is the choice of speaker embedding over speaker encoding to mitigate overfitting in data-limited scenarios—targeted adjustments that underscore the thoughtfulness in the system design.

Numerical Results and Observations

The paper presents empirical results demonstrating the system's capacity to modify individual song attributes effectively. Illustrations such as mel spectrograms showcase the transformations achieved through control over lyrics, rhythm, pitch, and timbre. Notably, the ability to create a synchronized duet using a mere two minutes of training data from an author's voice exemplifies the system's potential in practical applications.

However, the paper also delineates challenges, notably artifacts produced during vocal synthesis, which warrants further refinement. Artifacts aside, the system's implementation demonstrates impressive fidelity in flexible voice control, offering a substantial stride forward from prior approaches which could not disentangle individual vocal attributes.

Theoretical and Practical Implications

This work bears considerable theoretical implications by expanding the capabilities of voice synthesis to include manipulable attributes, thus fostering new avenues in the understanding and modeling of audio signals in a holistic manner. Practically, the system is poised to revolutionize areas such as music production, content creation, and entertainment by enabling non-expert users to engage creatively with vocal synthesis.

The proposed approach can also be seen as a foundational step towards more complex and user-friendly musical AI systems. By making intricate elements accessible, it could serve as a basis for educational tools or creative platforms that further bridge the gap between technology and art.

Future Directions

Looking ahead, addressing the spectral artifacts in the synthesized singing voices, as identified in the paper, will be crucial for improving output quality. Alongside, expanding the dataset diversity and incorporating diverse musical genres may optimize the model's adaptability and robustness. Integration of advanced neural network techniques may further enhance the model's capacity to handle complex voice modulations, paving the way for more nuanced and expressive vocal synthesis.

In summary, the research presents a structured and capable system that advances the current understanding and application of singing voice synthesis. By enabling detailed yet approachable control over vocal attributes, this paper contributes meaningfully to the evolving intersection of artificial intelligence and musical creativity.

PDF Markdown

Related Papers

GitHub

GitHub - maum-ai/assem-vc: Official Code for Assem-VC @ICASSP2022 (266 stars)