MuseCoco: Generating Symbolic Music from Text (2306.00110v1)

Published 31 May 2023 in cs.SD, cs.AI, cs.CL, cs.LG, cs.MM, and eess.AS

Abstract: Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.

PDF Abstract

MuseCoco: A Framework for Symbolic Music Generation from Text

MuseCoco presents an innovative approach for generating symbolic music directly from textual descriptions, addressing a significant gap in the music generation paradigm. Unlike previous efforts focused on generating musical audio, MuseCoco capitalizes on the editability of symbolic music to offer a more adaptable and controllable system. This framework is structured into two distinct stages: text-to-attribute understanding and attribute-to-music generation, which collectively streamline the process of converting text descriptions into coherent symbolic compositions.

The dual-stage process begins with the extraction of musical attributes from input text during the text-to-attribute understanding phase. This involves transforming qualitative descriptions into discrete musical attributes, leveraging a LLM fine-tuned with paired text-attribute data. Notably, a combination of supervised learning with synthesized data and template refinement via ChatGPT allows the framework to effectively address the inherent scarcity of paired text-music datasets.

In the attribute-to-music generation stage, the extracted attributes guide the generation of music sequences. The use of prefix tokens for controlling the generative model aligns the output with specific musical attributes, enabling fine-grained control over the composition. The model is trained in a self-supervised manner, utilizing a vast corpus of symbolic music data without the necessity of extensive annotated datasets. This self-supervised approach significantly enhances data efficiency and expands the potential for broader musical attribute manipulation.

MuseCoco's performance is evidenced by its superior musicality, controllability, and overall quality scores compared to baseline models such as GPT-4 and BART-base. The system's robust architecture includes a large-scale model with 1.2 billion parameters, further enhancing its controllability and musicality. A notable improvement in ASA (Average Sample-wise Accuracy) underscores its ability to accurately translate textual prompts into attribute-driven compositions.

From a practical standpoint, MuseCoco's implications are multifaceted. For musicians and content creators, it offers a novel tool for inspiration and composition, allowing for a more streamlined creative process without the confines of traditional methodical composition. The system's adaptability makes it accessible for users with varying expertise in music theory, enabling both professionals and amateur creators to harness its capabilities.

Theoretically, MuseCoco contributes to the field of AI-driven creative tools by demonstrating the potential of multi-stage frameworks for complex generative tasks. The modularity of the approach lends itself to future enhancements, such as integrating additional musical attributes or exploring more nuanced aspects of musical creativity. Moreover, the work invites further investigation into similar applications across other creative domains, suggesting a conceptual blueprint for leveraging AI in diverse artistic processes.

Overall, MuseCoco stands as a significant contribution to the domain of AI in music, offering a structured, controllable, and efficient framework for symbolic music generation from text. As the field progresses, such systems promise to redefine creative boundaries, blending advanced AI techniques with artistic expression to foster new forms of musical innovation.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Peiling Lu (8 papers)
Xin Xu (187 papers)
Chenfei Kang (2 papers)
Botao Yu (13 papers)
Chengyi Xing (6 papers)
Xu Tan (164 papers)
Jiang Bian (229 papers)

Citations (32)

View on Semantic Scholar

MuseCoco: Generating Symbolic Music from Text (2306.00110v1)

MuseCoco: A Framework for Symbolic Music Generation from Text

Related Papers

GitHub

YouTube