MT3: Multi-Task Multitrack Music Transcription (2111.03017v4)

Published 4 Nov 2021 in cs.SD, cs.LG, and eess.AS

Abstract: Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are "low-resource", as even expert musicians find music transcription difficult and time-consuming. Thus, prior work has focused on task-specific architectures, tailored to the individual instruments of each task. In this work, motivated by the promising results of sequence-to-sequence transfer learning for low-resource NLP, we demonstrate that a general-purpose Transformer model can perform multi-task AMT, jointly transcribing arbitrary combinations of musical instruments across several transcription datasets. We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). Finally, by expanding the scope of AMT, we expose the need for more consistent evaluation metrics and better dataset alignment, and provide a strong baseline for this new direction of multi-task AMT.

Authors (5)

Josh Gardner (20 papers)
Ian Simon (16 papers)
Ethan Manilow (18 papers)
Curtis Hawthorne (17 papers)
Jesse Engel (30 papers)

Citations (78)

View on Semantic Scholar

Summary

MT3: Multi-Task Multitrack Music Transcription

The paper "MT3: Multi-Task Multitrack Music Transcription" introduces a novel approach to Automatic Music Transcription (AMT) through a multi-task, multitrack framework that leverages the capabilities of a generalized Transformer model. The research tackles the intrinsic challenges of AMT, which are notably more complex than tasks like Automatic Speech Recognition (ASR) due to factors like multiple concurrent instruments and intricate pitch and timing details.

Methodology and Contributions

The authors propose a sequence-to-sequence approach using a Transformer architecture, akin to those used in NLP for tasks with limited resources. This contrasts sharply with previous models that often required instrument-specific architectures. Key contributions include:

Unified Framework: The introduction of a tokenization scheme capable of managing diverse datasets and a flexible vocabulary enabling the conversion of audio spectrogram sequences into multitrack MIDI files using general-purpose Transformer models.
Dataset Aggregation: Compilation of six multitrack AMT datasets of varied sizes and styles, forming the largest collective resource available to date for AMT training.
Consistent Evaluation: Establishment of standardized evaluation metrics across datasets, including a novel instrument-sensitive transcription metric assessing both note and instrument accuracy.
State-of-the-Art Performance: Implementation of a T5-based architecture within the proposed framework achieved state-of-the-art results on several datasets, exceeding the performance of both professional-grade DSP tools and specialized dataset-specific models.

Numerical Results

The paper presents considerable improvements in transcription quality across a variety of datasets:

The model achieved significant gains in Frame, Onset, and Onset-Offset F1 scores, showing up to 260% improvement on low-resource datasets while maintaining high performance on abundant datasets.
The multi-instrument F1 score introduced here demonstrates the model's proficiency in accurately associating transcribed notes with their respective instruments.
Additionally, the mixture training approach markedly enhanced outcomes for data-scarce datasets within this research field.

Implications and Future Directions

The implications of this work are significant both practically and theoretically. Practically, it provides a robust baseline, ushering a new direction in AMT research that prioritizes a unified modeling approach over task-specific solutions. Theoretically, it introduces a new paradigm in leveraging sequence-to-sequence models within the musical domain, akin to breakthroughs observed in NLP and other fields.

For future developments, the paper suggests exploration into leveraging unlabeled data through semi-supervised learning techniques, potentially augmenting the training dataset diversity and richness. Another promising avenue lies in enhancing training methodologies through data augmentation strategies, such as generating synthetic data by merging disparate musical segments. The outcomes from MT3 also pave pathways for deeper integration of AMT models with generative models for tasks like music composition, potentially catalyzing developments in creative AI applications.

In summary, this work not only sets a new benchmark for AMT tasks but also broadens the scope for multi-instrument transcription in music information retrieval, offering a scalable framework adaptable to the growing exigencies in music technology research.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/andrewtanyongyi/status/1797472654053757119

YouTube

Show All Videos