Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Melody transcription via generative pre-training (2212.01884v1)

Published 4 Dec 2022 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: Despite the central role that melody plays in music perception, it remains an open challenge in music information retrieval to reliably detect the notes of the melody present in an arbitrary music recording. A key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles - existing strategies work well for some melody instruments or styles but not all. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio, thereby improving performance on melody transcription by $20$% relative to conventional spectrogram features. Another obstacle in melody transcription is a lack of training data - we derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music. The combination of generative pre-training and a new dataset for this task results in $77$% stronger performance on melody transcription relative to the strongest available baseline. By pairing our new melody transcription approach with solutions for beat detection, key estimation, and chord recognition, we build Sheet Sage, a system capable of transcribing human-readable lead sheets directly from music audio. Audio examples can be found at https://chrisdonahue.com/sheetsage and code at https://github.com/chrisdonahue/sheetsage .

Citations (15)

Summary

  • The paper leverages generative pre-training with Jukebox representations to boost melody transcription accuracy by 27% compared to spectrogram-based features.
  • It introduces a novel 50-hour crowdsourced dataset that enables a 77% improvement over previous baselines in melody transcription.
  • The method employs Transformer models with beat detection to accurately capture note onsets, offering detailed transcriptions for practical music applications.

Melody Transcription via Generative Pre-Training

The paper, "Melody Transcription via Generative Pre-Training," presents a method to improve melody transcription within the domain of Music Information Retrieval (MIR). The authors tackle two primary challenges: the diversity of audio sources and a lack of annotated data for training.

The primary contribution is the employment of representations derived from Jukebox, a generative model pre-trained on extensive music audio. This approach enhanced melody transcription performance by 27% relative to conventional spectrogram-based features, demonstrating efficacy across a broad spectrum of instrumental ensembles and musical styles. The improvements underscore the potential of leveraging large-scale generative models in time-varying MIR tasks, contrasting with traditional song-level objectives.

Additionally, the paper introduces a newly curated dataset consisting of 50 hours of melody transcriptions acquired through crowdsourcing. This dataset plays a critical role in the presented methodology, enabling the authors to achieve a 77% improvement over the previous strongest baseline.

The method involves extracting audio representations via Jukebox and averaging these across time intervals in alignment with beat detection using the madmom library. A Transformer model is subsequently employed to discern note onsets, allowing for conversion to MIDI or music scores. This approach notably diverges from existing melody extraction techniques by offering a more complete transcription solution outputting notes with specific onsets and pitches.

In terms of evaluation, the paper uses an onset-only note-wise F1 metric, underscoring precision in note onset detection, with allowances for octave shifts in prediction. This addresses real-world applicability where such invariance is critical for MUS tasks.

Empirically, using Jukebox features resulted in superior performance compared to handcrafted spectrograms or those derived from MT3, another pre-trained model. The research findings imply that pre-training on generative tasks can greatly benefit MIR applications that require nuanced temporal audio understanding.

Apart from methodological contributions, the authors demonstrate Sheet Sage—a system that transcribes music audio into human-readable lead sheets by integrating melody transcription with beat detection, key estimation, and chord recognition. This application exemplifies practical utility and might inspire further developments in Automated Music Transcription Systems (AMTS).

The implications of this research are significant. Improvement in melody transcription can enhance interactive music applications, educational tools, source separation technologies, and more. Future work may explore scaling the pre-training processes, diversifying the training datasets, or applying similar strategies to other polyphonic or MIR-related tasks.

Overall, the paper emphasizes the potential of leveraging pre-trained generative models for improving MIR tasks, opening avenues for innovation in both practical applications and theoretical advancements within AI-driven music analysis.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com