XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework (2501.08809v1)

Published 15 Jan 2025 in cs.SD, cs.AI, and eess.AS

Abstract: In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is https://xmusic-project.github.io.

Summary

The paper introduces XMusic, a framework for controllable symbolic music generation that accepts diverse inputs (text, image, video, etc.) to align music with specific emotions and genres.
XMusic utilizes XProjector for multi-modal input parsing and XComposer (Generator, Selector) with an improved symbolic representation to generate and select high-quality music outputs.
The framework introduces the large-scale XMIDI dataset and demonstrates significant improvements in quality and emotional expressiveness compared to existing state-of-the-art methods.

Overview of XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

The paper under examination presents a novel framework, XMusic, designed for generalized and controllable symbolic music generation. This framework addresses the current limitations in quality and emotional control of AI-generated music. XMusic is notable for its flexibility in supporting diverse prompts—including images, videos, texts, tags, and humming—and generating music that aligns with specified emotional and genre attributes.

Core Components of XMusic

XMusic is structured around two main components: XProjector and XComposer. XProjector serves as a multi-modal parser that decodes input prompts, irrespective of their source, into symbolic music elements such as emotion, genre, rhythm, and notes. This functionality is pivotal for adjusting the emotional texture of the generated music based on the input's implicit or explicit cues.

XComposer comprises two subcomponents: a Generator and a Selector. The Generator implements a Transformer-based model which leverages the music elements parsed by XProjector to generate coherent music sequences. The improved symbolic music representation within this model, adapted from the Compound Word framework, introduces enhancements such as tag-related and instrument-related tokens that improve control over music generation. The Selector is responsible for assessing and identifying high-quality musical outputs. This assessment is performed through a multi-task learning scheme that focuses on quality, emotion, and genre recognition to ensure that only superior outputs are considered as final results.

Contributions and Results

The paper introduces XMIDI, a comprehensive dataset comprising 108,023 MIDI files augmented with precise emotion and genre labels. This resource is crucial for training models like XMusic, addressing the scarcity of large-scale, well-annotated symbolic music datasets.

XMusic is empirically validated against state-of-the-art methods through both objective and subjective experiments, demonstrating significant improvements in quality and emotional expressiveness. The implementation details of XMusic indicate robustness in the model's training process, utilizing large-scale datasets and fine-tuning through rigorous evaluative criteria.

Implications and Future Directions

This work advances the field of AI-generated music substantially by providing a scalable solution that integrates various input modalities. The introduction of the unified projection space for symbolic music elements offers a promising avenue for further research and development in multi-modal AI systems. Practically, the flexibility of XMusic suggests broad applications ranging from adaptive soundtracks to user-driven music composition tools.

Future research directions could involve exploring other input modalities and refining the symbolic representation to include additional music dimensions like time signatures and key transitions. As emphasized, expanding the dataset to encompass more diverse emotional and genre labels is also a priority to bolster the model's reliability and applicability across different musical contexts.

In summary, the XMusic framework represents a significant advancement in controllable music generation, offering precise emotional alignment and high-quality outputs—a valuable tool for researchers and industry professionals in the pursuit of AI-driven creativity in music.

PDF Markdown

Related Papers

GitHub

XMusic