MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training (2106.05630v1)

Published 10 Jun 2021 in cs.SD, cs.CL, cs.IR, cs.MM, and eess.AS

Abstract: Symbolic music understanding, which refers to the understanding of music from the symbolic data (e.g., MIDI format, but not audio), covers many music applications such as genre classification, emotion classification, and music pieces matching. While good music representations are beneficial for these applications, the lack of training data hinders representation learning. Inspired by the success of pre-training models in natural language processing, in this paper, we develop MusicBERT, a large-scale pre-trained model for music understanding. To this end, we construct a large-scale symbolic music corpus that contains more than 1 million music songs. Since symbolic music contains more structural (e.g., bar, position) and diverse information (e.g., tempo, instrument, and pitch), simply adopting the pre-training techniques from NLP to symbolic music only brings marginal gains. Therefore, we design several mechanisms, including OctupleMIDI encoding and bar-level masking strategy, to enhance pre-training with symbolic music data. Experiments demonstrate the advantages of MusicBERT on four music understanding tasks, including melody completion, accompaniment suggestion, genre classification, and style classification. Ablation studies also verify the effectiveness of our designs of OctupleMIDI encoding and bar-level masking strategy in MusicBERT.

Citations (111)

View on Semantic Scholar

Summary

The paper introduces MusicBERT, a pre-trained model that applies innovative OctupleMIDI encoding and bar-level masking to enhance symbolic music representation.
It demonstrates significant performance improvements in melody completion, accompaniment suggestion, and classification tasks over baseline models.
Empirical evaluations on the Million MIDI Dataset validate MusicBERT's effectiveness and promise for advancing music recommendation and composition.

Symbolic Music Understanding with MusicBERT

The paper presents MusicBERT, a pre-trained model designed to advance symbolic music understanding by leveraging large-scale pre-training techniques inspired by NLP. MusicBERT is developed to address challenges inherent in symbolic music data, such as its structural complexity and diversity compared to natural text data. The authors aim to improve music representation learning, which is crucial for applications like genre classification, emotion classification, and music piece matching.

Key Innovations in MusicBERT

Several novel contributions distinguish MusicBERT from existing methods in symbolic music understanding:

OctupleMIDI Encoding: The authors introduce OctupleMIDI, an advanced encoding method that significantly compresses the length of music data sequences by encoding each note into an octuple with eight elements: time signature, tempo, bar, position, instrument, pitch, duration, and velocity. This encoding method offers a compact, efficient, and universal way to represent symbolic music, which is crucial for handling large datasets and for efficient processing within a Transformer model.
Bar-Level Masking Strategy: In contrast to the naive token-level masking strategies traditionally used in NLP pre-training, MusicBERT employs a bar-level masking strategy for symbolic music. This approach prevents information leakage by masking all tokens of the same type in a music bar simultaneously, thereby enhancing the learning of contextual representations.
Million MIDI Dataset: The authors curated a large-scale symbolic music corpus, the Million MIDI Dataset (MMD), containing over 1 million music songs, which addresses the common limitation of small datasets in symbolic music understanding. This substantial database supports effective pre-training, allowing MusicBERT to benefit from diverse musical genres and styles.

Empirical Evaluation

To assess the efficacy of MusicBERT, the authors perform fine-tuning on four downstream music understanding tasks: melody completion, accompaniment suggestion, genre classification, and style classification. MusicBERT consistently outperformed baseline models, demonstrating significant gains in accuracy and F1 scores.

On melody completion and accompaniment suggestion tasks, MusicBERT showcases superior mean average precision (MAP) and HITS@k scores, indicating improved performance in generating or selecting appropriate musical phrases.
In the genre and style classification tasks, MusicBERT achieves higher F1 scores, highlighting its capability to capture song-level attributes more effectively than prior approaches.

Ablation studies further underscore the benefits of the innovations introduced in MusicBERT, confirming the advantages of OctupleMIDI encoding and the bar-level masking strategy.

Implications and Future Directions

The implications of MusicBERT are multifaceted. Practically, it enhances various music-related applications, potentially benefitting the fields of music recommendation, composition, and analysis. Theoretically, MusicBERT explores the intersection of NLP methodologies in non-text domains, pushing forward cross-disciplinary applications of Technology in arts.

Future work may focus on extending MusicBERT's architecture for broader music understanding tasks such as chord recognition and structure analysis, or adapting its methodologies to other symbolic data formats in different creative domains. Moreover, considering the scalability demonstrated by the large-scale pre-training dataset, ongoing efforts might explore even more extensive corpora to refine model precision and generalizability.

In conclusion, MusicBERT represents a substantial advance in symbolic music understanding, embracing large-scale pre-training and innovative encoding techniques to enhance music applications in both academia and industry.

PDF Markdown

Related Papers

YouTube

Show All Videos