Onsets and Frames: Dual-Objective Piano Transcription (1710.11153v2)

Published 30 Oct 2017 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note to start unless the onset detector also agrees that an onset for that pitch is present in the frame. We focus on improving onsets and offsets together instead of either in isolation as we believe this correlates better with human musical perception. Our approach results in over a 100% relative improvement in note F1 score (with offsets) on the MAPS dataset. Furthermore, we extend the model to predict relative velocities of normalized audio which results in more natural-sounding transcriptions.

Citations (264)

View on Semantic Scholar

Summary

The paper presents a dual-objective neural network that leverages onset detection to improve frame prediction, achieving over 100% relative improvement in note F1 score.
It combines CNN and RNN layers to process onsets and frames concurrently, ensuring that frame activations align with key musical events.
The findings open avenues for enhanced music transcription and broader MIR applications, including dynamic expressive performance analysis and digital music production.

Onsets and Frames: Dual-Objective Piano Transcription

The paper "Onsets and Frames: Dual-Objective Piano Transcription" presents a novel approach to polyphonic piano music transcription through the utilization of a deep neural network architecture. The research addresses one of the enduring challenges in Music Information Retrieval (MIR): generating symbolic music representations (such as MIDI) from raw audio, specifically for polyphonic piano compositions. Traditional methods, such as Nonnegative Matrix Factorization (NMF), have been supplanted by advanced neural network models due to their promising results in similar audio-related tasks.

In this paper, the authors propose a dual-objective network composed of CNN and RNN layers that jointly optimizes for both onset and frame predictions. The pivotal innovation is the focus on accurately detecting note onset events and leveraging this information to condition framewise predictions of note activations. During inference, the onset predictions act as a gating mechanism to ensure that the framewise detector only activates if the onset detector agrees, thereby aligning more closely with human auditory perception of music.

The empirical results on the MAPS dataset reveal a noteworthy advancement, with the proposed model achieving over a 100% relative improvement in note F1 score with offsets compared to prior methods. These results emphasize the effectiveness of addressing both onsets and offsets jointly, a departure from earlier approaches that managed them in isolation. Additionally, a new metric is introduced that incorporates velocity estimation, thus capturing the expressive dynamics of piano pieces more naturally.

Implications and Future Directions

This research offers significant implications for both theoretical and practical applications in the domain of music transcription. The dual-objective architecture could be adapted for transcription tasks beyond the piano, broadening the scope of MIR applications and aiding in the development of automated systems that require intricate auditory analysis, such as in digital music production and interactive music systems.

From a theoretical perspective, the success of this model supports the hypothesis that integrating onset information significantly enhances transcription accuracy. The findings suggest avenues for further research into onset-based conditioning of framewise inference in other polyphonic transcription tasks. Furthermore, the authors point out the need for more extensive datasets that include various recording environments and music styles, highlighting a gap that future research could address.

Despite its successes, the paper acknowledges certain constraints, such as the limited size of available datasets and challenges associated with training models on recordings from diverse audio sources. Addressing these limitations could bolster the robustness and generalizability of such transcription models. Additionally, integrating these transcription models with LLMs remains a promising area that could yield synergistic improvements in broader music processing tasks.

Overall, the paper presents a notable contribution to the field of automatic music transcription, providing a robust framework that not only progresses the state of the art in piano transcription but also lays the groundwork for future exploration in MIR and beyond.

PDF Markdown

Related Papers

YouTube

Show All Videos