BERT-like Pre-training for Symbolic Piano Music Classification Tasks (2107.05223v2)

Published 12 Jul 2021 in cs.SD, cs.LG, cs.MM, and eess.AS

Abstract: This article presents a benchmark study of symbolic piano music classification using the masked LLMling approach of the Bidirectional Encoder Representations from Transformers (BERT). Specifically, we consider two types of MIDI data: MIDI scores, which are musical scores rendered directly into MIDI with no dynamics and precisely aligned with the metrical grid notated by its composer and MIDI performances, which are MIDI encodings of human performances of musical scoresheets. With five public-domain datasets of single-track piano MIDI files, we pre-train two 12-layer Transformer models using the BERT approach, one for MIDI scores and the other for MIDI performances, and fine-tune them for four downstream classification tasks. These include two note-level classification tasks (melody extraction and velocity prediction) and two sequence-level classification tasks (style classification and emotion classification). Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.

Citations (26)

View on Semantic Scholar

Summary

The paper introduces MidiBERT-Piano, a Transformer model pre-trained on over 4,000 polyphonic piano pieces, which outperforms RNN baselines on classification tasks.
It adapts MIDI token representations, such as REMI and compound words, to efficiently process symbolic music and enhance sequence coherence.
The model’s success in melody extraction, velocity prediction, and genre classifications sets a new benchmark for symbolic music understanding and future research.

An Expert's Analysis of "MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding"

The research paper presented focuses on the application of large-scale pre-training, particularly using a BERT-like Transformer model, for advancing symbolic music understanding. The model, MidiBERT-Piano, demonstrates how pre-trained Transformer networks can effectively tackle various discriminative tasks related to symbolic music, specifically in the context of polyphonic piano MIDI files.

Core Contributions

MidiBERT-Piano is pre-trained on 4,166 pieces of polyphonic piano music and showcases its application on four classification tasks: melody extraction, velocity prediction, composer classification, and emotion classification. A notable finding across these tasks is that MidiBERT-Piano, leveraging the Transformer architecture, consistently outperformed recurrent neural network (RNN) based baselines, exhibiting superior performance with minimal fine-tuning epochs.

The researchers employed a self-supervised learning strategy called mask LLMing (MLM) during pre-training, which is a technique well-known in NLP. The strategy was adapted from BERT to accommodate the nuances of symbolic music, treating MIDI data akin to language sequences.

Methodology

The research explores two token representations for MIDI data, namely REMI and a compound word (CP) approach, enhancing sequence processing efficiency. The CP representation groups multiple tokens into a "super token," reducing sequence lengths and purportedly improving musical coherence in self-attention mechanisms of the Transformer.

Pre-training involves a large corpus with a substantial portion of piano MIDI data, accompanied by a comprehensive evaluation on melody, velocity, and two sequence-level classification tasks. Fine-tuning of MidiBERT-Piano is done for each specific task, highlighting the flexibility and generalization capability of the pre-trained model.

Strong Numerical Results

The experimental results validate the effectiveness of MidiBERT-Piano, with the CP representation yielding noteworthy improvements over the baseline models for all tasks. Particularly, the model achieved an impressive 96.37% accuracy in melody extraction and showed significant advancements in unspectacular tasks where traditional methods like the skyline algorithm underperformed.

Evaluation and Implications

The model's superior performance in sequence-level tasks points towards its potential in broader applications, suggesting that BERT-like models can effectively encode and exploit complex patterns in symbolic music data. This has theoretical implications for transfer learning in domains scarce with labeled data, offering a new avenue for research in symbolic music representation learning.

Practically, MidiBERT-Piano and the accompanying dataset provide a robust benchmark for future developments in symbolic music understanding, serving as a baseline for subsequent research. The release of code and data further promotes reproducibility and encourages collaboration within the research community.

Future Prospects

The research outlines several prospective directions, including the exploration of alternate pre-training strategies and the expansion to multi-track MIDI datasets. These potential lines of inquiry could enhance the representational capacity of such systems, making them applicable to a broader range of music tasks beyond those assessed.

In summary, MidiBERT-Piano exemplifies a significant stride in utilizing Transformers for symbolic music understanding, setting a foundational stage for the integration of deep learning models in processing and interpreting musical data at scale. The research provides substantial evidence on the utility of pre-trained models in music theory and computational musicology, fostering innovation in AI-driven music analysis.

PDF Markdown

Related Papers

GitHub

GitHub - wazenmai/MIDI-BERT: This is the official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding. (194 stars)

Tweets

https://twitter.com/ArxivSound/status/1780084795432644964

https://twitter.com/AudioAndSpeech/status/1780172585986134289

https://twitter.com/antoniohyder/status/1786165261696909691