Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 74 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 37 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing (2409.10831v2)

Published 17 Sep 2024 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at https://pnlong.github.io/PDMX.demo/.

Summary

The paper introduces PDMX, a large-scale public domain dataset containing over 250,000 MusicXML files to address copyright limitations in symbolic music generation research.
The authors extend the MusPy library with MusicRender to parse MusicXML more effectively and filter the dataset based on metadata for improved data quality.
Experiments show that models trained on filtered PDMX subsets yield higher quality generated music, indicating the dataset's value for advancing symbolic music AI.

PDMX: A Comprehensive MusicXML Dataset

The paper entitled "PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing" introduces PDMX, a substantial open-source dataset comprised of over 250,000 MusicXML files. Harvested from MuseScore, a widely used score-sharing platform, this dataset addresses significant gaps in the availability of large-scale, public domain symbolic music data. The dataset aims to alleviate the challenges associated with copyright concerns in training generative models by providing copyright-free data.

Context and Motivation

The burgeoning field of generative AI in music is witnessing rapid advancements, notably in audio-domain music generation. However, this progress raises critical issues surrounding data copyright and the ethical implications of potentially supplanting human artists. Symbolic music generation, in contrast, offers a more collaborative approach suitable for artist-in-the-loop systems. Despite this, the availability of public domain symbolic music data is severely limited, impeding research and development in this domain. Furthermore, the prevalent use of MIDI format datasets, which lack comprehensive musical annotations, presents additional constraints.

Dataset Description

PDMX emerges as an invaluable resource in this context, consisting exclusively of MusicXML files from the public domain or released under a CC-0 license. With detailed metadata, including genre, user ratings, and interaction information, PDMX facilitates an enriched understanding of the dataset. MusicXML is preferred over MIDI due to its ability to encapsulate a wider array of musical notations, such as articulations, dynamics, and other directives critical for nuanced music generation tasks.

Methodological Advances

To effectively parse the MusicXML files, the authors extend the capabilities of MusPy, introducing MusicRender. This tool enhances the parsing and processing of MusicXML, enabling the extraction of performance qualities not typically represented in MIDI datasets. MusicRender supports the realization of musical notations and performance nuances, thus providing a more interpretive dataset for modeling tasks.

Moreover, the dataset undergoes filtering based on metadata such as ratings and deduplication processes, resulting in subsets that enhance data quality and model training efficiency. The creation of subsets—such as "Rated", "Deduplicated", and their intersection—serves to stratify the dataset according to inferred data quality, enabling more targeted and reliable model training.

Experimental Insights

The authors evaluate the impact of PDMX on multitrack symbolic music generation. Through comparative analysis across different dataset subsets, they demonstrate the superior efficacy of using filtered and deduplicated subsets for model training. The experiments reveal that models trained on higher-quality data subsets, characterized by user-rated scores, exhibit enhanced richness and musical interest in generated compositions. Fine-tuning further amplifies these benefits across models.

Implications and Future Directions

The introduction of PDMX holds substantial implications for the field of AI music generation, especially within symbolic music domains. It provides a reliable dataset for developing models that respect copyright norms and supports diverse musical applications. The extensive metadata and multitrack capacity open avenues for refined AI tasks, such as expressive performance rendering and symbolic music recommendation systems.

In future developments, leveraging the extensive metadata in PDMX could foster novel applications in music information retrieval and performance analysis. Furthermore, the dataset could serve as a pretraining corpus for advanced symbolic and audio-domain generation models, promoting more contextual and semantically rich AI-generated music.

In conclusion, PDMX represents a noteworthy step forward in enhancing the accessibility and quality of symbolic music datasets, addressing key challenges in AI music synthesis, and enabling refined research opportunities in this evolving field.