Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation (2406.10970v1)

Published 16 Jun 2024 in cs.SD and eess.AS

Abstract: We present JASCO, a temporally controlled text-to-music generation model utilizing both symbolic and audio-based conditions. JASCO can generate high-quality music samples conditioned on global text descriptions along with fine-grained local controls. JASCO is based on the Flow Matching modeling paradigm together with a novel conditioning method. This allows music generation controlled both locally (e.g., chords) and globally (text description). Specifically, we apply information bottleneck layers in conjunction with temporal blurring to extract relevant information with respect to specific controls. This allows the incorporation of both symbolic and audio-based conditions in the same text-to-music model. We experiment with various symbolic control signals (e.g., chords, melody), as well as with audio representations (e.g., separated drum tracks, full-mix). We evaluate JASCO considering both generation quality and condition adherence, using both objective metrics and human studies. Results suggest that JASCO is comparable to the evaluated baselines considering generation quality while allowing significantly better and more versatile controls over the generated music. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/JASCO.

PDF HTML Abstract

Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation

The paper "Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation" presents Jasco, a novel model that integrates both symbolic and audio-based conditioning to generate temporally controlled, high-quality musical compositions. The model emphasizes a Flow Matching paradigm to exploit both global textual descriptions and fine-grained local controls such as chords and melody.

Jasco’s primary contribution lies in its ability to incorporate multiple forms of conditions—both symbolic and audio-based—at different temporal resolutions. The model achieves this through the strategic use of information bottleneck layers and temporal blurring techniques. This ensures the extraction of the most relevant information for generating high-fidelity music samples, a feature not comprehensively addressed by prior models which heavily focused on global textual descriptions alone.

Methodology

The methodology employs a Continuous Normalizing Flow model, particularly a Conditional Flow Matching (CFM) approach. This optimizes a regression loss aimed at predicting the vector field of the continuous latent audio variable, given temporal and other conditions.

The model’s architecture includes temporal controls for symbolic elements (chord progression and melody) and audio elements (general audio and drum stems). Chord progression and melody are extracted using pre-trained models and projected into low-dimensional embeddings for conditioning. This is uniquely executed through a concatenation-based injection method, which validates superior adherence to temporal controls compared to additive methods like cross-attention or zero-convolutions.

In terms of audio elements, the model uses pre-trained source separation networks, such as Hybrid Demucs, for extracting drum tracks. These latent representations are then blurred and filtered to form conditioning signals.

Experimental Setup and Results

The paper conducted exhaustive experiments using widely recognized benchmarks such as the MusicCaps dataset. The model was evaluated on several fronts including Fréchet Audio Distance (FAD), Kullback-Leibler Divergence (KL), CLAP score (audio-text alignment), melody similarity, and rhythmic alignment with conditions.

The results demonstrated superior performance in terms of condition adherence metrics such as melody similarity and chord progression accuracy. Notably, Jasco showed better or comparable FAD and KL scores, ensuring it maintained high audio quality. Human evaluations further corroborated these findings, showcasing Jasco’s efficacy in generating music that aligned well with both textual and temporal local controls.

Implications and Future Directions

Jasco provides a robust framework for music generation, accommodating a spectrum of controls without compromising on quality. This capability opens new avenues for content creators and musicians, enabling them to leverage AI for complex musical compositions with a higher degree of control over the generated output.

Potential future developments may involve extending the model to support additional controls such as music dynamics and broader structural elements of music (e.g., sections like intro, verse, chorus). Another promising direction could be to refine the model to support longer and more intricate compositions, addressing current limitations on sample length and generation time. Enhancements may also include finer disentangling capabilities to better manage overlapping audio and symbolic conditions.

Conclusion

The paper provides a significant advancement in the field of AI-driven music generation, showing that a combination of symbolic and audio-based conditioning can yield highly controlled and high-quality musical outputs. By leveraging the Flow Matching paradigm and implementing innovative temporal control techniques, Jasco establishes a new benchmark for conditional text-to-music models. Such contributions not only enhance creative processes for artists but also set the stage for more sophisticated and versatile applications in AI-generated music.

This work offers an invaluable resource to further research in AI-driven music composition, highlighting the potential for more dynamic and user-controlled music generation tools in the future.