A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation (2203.09893v2)

Published 18 Mar 2022 in cs.SD, cs.LG, and eess.AS

Abstract: Automatic Music Transcription (AMT) has been recognized as a key enabling technology with a wide range of applications. Given the task's complexity, best results have typically been reported for systems focusing on specific settings, e.g. instrument-specific systems tend to yield improved results over instrument-agnostic methods. Similarly, higher accuracy can be obtained when only estimating frame-wise $f_0$ values and neglecting the harder note event detection. Despite their high accuracy, such specialized systems often cannot be deployed in the real-world. Storage and network constraints prohibit the use of multiple specialized models, while memory and run-time constraints limit their complexity. In this paper, we propose a lightweight neural network for musical instrument transcription, which supports polyphonic outputs and generalizes to a wide variety of instruments (including vocals). Our model is trained to jointly predict frame-wise onsets, multipitch and note activations, and we experimentally show that this multi-output structure improves the resulting frame-level note accuracy. Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems. With this work we hope to encourage the community to further investigate low-resource, instrument-agnostic AMT systems.

Citations (35)

View on Semantic Scholar

Summary

The paper proposes a lightweight, instrument-agnostic model that uses harmonic stacking and a shallow architecture with 16,782 parameters for efficient polyphonic note transcription.
The model achieves state-of-the-art multipitch estimation and note accuracy across multiple benchmarks, including GuitarSet and MAESTRO, while running on low-resource devices.
Ablation studies confirm that harmonic stacking and a supervised bottleneck layer are critical for optimizing performance and refining pitch estimations.

Essay on "A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation"

This paper presents a neural network-based approach to Automatic Music Transcription (AMT), focusing on a lightweight, instrument-agnostic method that supports polyphonic outputs and multipitch estimation. The authors propose a solution that addresses several constraints typically associated with AMT systems, such as high memory usage and the need for multiple specialized models for different instruments.

Model Overview

The proposed model design incorporates a shallow architecture with only 16,782 parameters, enabling it to run efficiently on low-resource devices. It processes audio through a Constant-Q Transform (CQT) and harmonic stacking, producing frame-level onset, multipitch, and note posteriorgrams. Distinctively, harmonic stacking aligns harmonically-related frequencies, thereby allowing small convolutional kernels to effectively capture relevant musical information. This feature supports the model's capability to maintain high performance with a reduced computational footprint.

Experimentation and Results

The authors conducted extensive evaluations on a diverse set of benchmark datasets, including MAESTRO, Slakh, and GuitarSet, among others. The model demonstrated superior performance in F-measure and frame-level note accuracy compared to MI-AMT, an instrument-agnostic baseline model. Notably, it achieved state-of-the-art results for guitar transcription on the GuitarSet dataset. Although results on vocal datasets like Molina indicate a need for further improvement in onset detection, the overall MPE performance remained competitive against established methods such as deep salience models.

Ablation Studies

Ablation experiments revealed the importance of harmonic stacking, as its removal consistently degraded performance across datasets. Similarly, the supervised bottleneck layer $Y_p$ enhanced frame-level accuracy, indicating its role in refining pitch estimations, though its impact on onset/offset detection varied across datasets.

Implications and Future Work

The paper highlights that the proposed lightweight architecture enables practical deployment in settings where computational resources are constrained. This could transform how AMT systems are utilized across different applications, particularly in areas where broad instrument coverage is essential without sacrificing performance.

Future research directions could include tackling the low-resource transcription of audio mixtures with multiple instruments and exploring more sophisticated note-event creation models. Techniques such as model pruning or compression may further optimize the architecture, simultaneously maintaining performance and reducing resource requirements. Another intriguing avenue could be investigating the interplay between note and multipitch outputs to enhance the estimation of note-level pitch bends.

In summary, this paper presents a significant step toward making polyphonic AMT more accessible and efficient, fostering a more inclusive approach to music transcription technology. The proposed model's capability to generalize across various instruments without retraining is instrumental in addressing the current limitations of AMT systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/dankuntz/status/1831412249359217100