Audio Conditioning for Music Generation via Discrete Bottleneck Features (2407.12563v2)

Published 17 Jul 2024 in cs.SD and eess.AS

Abstract: While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a LLM based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music LLM from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

PDF HTML Abstract

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Overview

The paper "Audio Conditioning for Music Generation via Discrete Bottleneck Features" presents a novel approach to music generation that employs audio inputs to condition a LLM, diverging from the more traditional textual or parametric conditioning methods. The authors outline two primary strategies for this novel input: textual inversion and a jointly trained style conditioner.

Key Contributions

Adaptation of Textual Inversion: The authors adapt the textual inversion method from the domain of image generation to a pre-trained text-to-music model. By optimizing a textual embedding through backpropagation, they establish a mechanism for performing audio conditioning without re-training the model from scratch.
Style Conditioner Design: A new method involving a style conditioner trained jointly with a text-to-music model is introduced. This conditioner uses a frozen audio feature extractor followed by a transformer encoder, Residual Vector Quantizer (RVQ), and temporal downsampling. This architecture allows the model to leverage both audio waveforms and textual descriptions simultaneously.
Double Classifier Free Guidance: The authors assert that audio contains far more information than text, leading to the development of a double classifier free guidance method that balances textual and audio conditioning at inference time.
Novel Objective Metrics: To validate their approach, the authors introduce objective metrics based on nearest neighbor searches in latent spaces, which they validate through human evaluations.

Numerical Results

The implementation and evaluation of these methods reveal several significant results:

The jointly trained style conditioner model achieves a Fréchet Audio Distance (FAD) of 0.85, superior to both the baseline continuation method (1.22) and a model using CLAP embeddings (0.96).
In terms of high-level audio similarity, the new model scores well on subjective metrics such as "Overall Quality" (OVL), "Similarity" (SIM), and "Variation" (VAR), indicating a good balance between close stylistic adherence and variety in generated music.

Implications and Future Directions

Practical Implications

The models and methods described in the paper offer a versatile tool for music creators, providing a way to generate music that remains coherent in style while incorporating both textual and audio inputs. This flexibility could significantly enhance content creation platforms and music production workflows by allowing fine-grained control over generated outputs.

Theoretical Implications

From a theoretical perspective, the introduction of audio conditioning through a discrete bottleneck provides a promising avenue for future research. The double classifier free guidance method, in particular, offers a new approach to balancing multiple forms of conditioning, potentially applicable to other generative models beyond music.

Speculation on Future Developments

Looking forward, the integration of more sophisticated audio feature extractors and further refinements to the RVQ and temporal downsampling techniques could enhance the fidelity and creative potential of these models. Additionally, expanding the scope of conditioning inputs to include other contextual data, such as user interaction patterns or even visual cues, could create even richer generative frameworks.

Conclusion

This paper marks a significant advancement in the field of music generation by demonstrating the feasibility and benefits of using discrete bottleneck features for audio conditioning. Through comprehensive experimentation and evaluation, the authors have provided a robust framework that sets the stage for future innovations in AI-driven music creation. The balanced interplay between textual and audio inputs, facilitated by innovative guidance methods and bottleneck designs, offers a compelling case for the broader adoption of these techniques in various AI-driven creative applications.