SingSong: Generating musical accompaniments from singing (2301.12662v1)

Published 30 Jan 2023 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus of music audio to produce aligned pairs of vocals and instrumental sources. Then, we adapt AudioLM (Borsos et al., 2022) -- a state-of-the-art approach for unconditional audio generation -- to be suitable for conditional "audio-to-audio" generation tasks, and train it on the source-separated (vocal, instrumental) pairs. In a pairwise comparison with the same vocal inputs, listeners expressed a significant preference for instrumentals generated by SingSong compared to those from a strong retrieval baseline. Sound examples at https://g.co/magenta/singsong

Citations (47)

View on Semantic Scholar

Summary

The paper introduces SingSong, an AI system that leverages source separation and adapted AudioLM to generate musical accompaniments specifically from singing inputs.
SingSong successfully adapts unconditional generative models for conditional audio generation and produces accompaniments preferred by listeners over retrieval-based baselines.
SingSong's implications include making music creation more accessible and opening theoretical avenues for diverse instrumental synthesis and adaptive source separation applications.

SingSong: Generating Musical Accompaniments from Singing

The paper presents SingSong, an innovative system designed to generate musical accompaniments for singing inputs, which proposes to offer both musicians and non-musicians an intuitive method for music creation centered around their vocal performances. By capitalizing on advancements in musical source separation and audio generation, the authors have developed a system capable of producing coherent instrumental backing tracks aligned with input vocals.

Methodology

The core of SingSong's approach involves two technological advancements: source separation and generative audio modeling. A state-of-the-art source separation algorithm is employed to break down a vast dataset of music tracks into pairs of vocals and instrumentals. This process generates the training data needed for the model. The authors leverage AudioLM, a modern audio generative model, adapting it for conditional audio generation tasks. The model learns to produce instrumental audio which can be seamlessly mixed with user vocals.

A significant challenge encountered was ensuring the model's generalization from training data consisting of source-separated vocals to the isolated vocal inputs expected from real users. Initial models displayed a bias towards reconstructing instrumental artifacts present in the vocal inputs rather than generating meaningful accompaniments for isolated vocals. To mitigate this, the paper details enhancements to input featurization, including adding noise to inputs and adjusting the representation layers used during conditioning.

Results

Quantitatively, the system was analyzed using the Frechet Audio Distance (FAD), indicating significant performance improvements when using isolated vocals. Listeners expressed a preference for instrumentals generated by SingSong over those from retrieval-based baselines, suggesting SingSong's capability to produce musically compatible accompaniments. Furthermore, scalability tests—from base to XL configurations—revealed improved qualitative results and user satisfaction with larger models.

Implications

The implications of SingSong extend both practically and theoretically. Practically, the system promises an accessible route to music creation, enabling users to bypass traditional instrumental proficiency barriers. Theoretically, successful adaptation of unconditional generative models—as evidenced by SingSong—suggests a broader applicability of such models to conditional audio generation tasks. Furthermore, the coupling between source separation advancements and generative tasks opens up potential prospects for synthesizing diverse instrumental accompaniments and supporting adaptive source separation applications.

Future Directions

In future investigations, the exploration into alternative featurization strategies for vocal inputs could enhance harmonic correspondence in the generated outputs. Moreover, increasing the sampling rate beyond the current 16 kHz could enhance the fidelity, making the system's output more applicable in professional contexts. Another avenue includes training systems for cross-instrumental generation, leveraging existing source separation networks.

Conclusion

As an advanced music generation system, SingSong's successful adaptation of AudioLM for conditional tasks marks an important step in utilizing AI models for creative audio applications. While providing insights into generative modeling, this paper delineates paths for further refinement of such systems, emphasizing enhancements in audio fidelity, conditioning strategies, and broader generative capabilities. Such research undoubtedly contributes to reshaping music creation and representation paradigms in the evolving landscape of AI-driven artistic technologies.

SingSong: Generating musical accompaniments from singing (2301.12662v1)

Summary

SingSong: Generating Musical Accompaniments from Singing

Methodology

Results

Implications

Future Directions

Conclusion

Follow-up Questions

Authors (11)

YouTube

SingSong: Generating musical accompaniments from singing (2301.12662v1)

Summary

SingSong: Generating Musical Accompaniments from Singing

Methodology

Results

Implications

Future Directions

Conclusion

Follow-up Questions

Related Papers

Authors (11)

YouTube