MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling (2408.05024v1)

Published 9 Aug 2024 in cs.SD, cs.CL, and cs.IR

Abstract: Guitar tablatures enrich the structure of traditional music notation by assigning each note to a string and fret of a guitar in a particular tuning, indicating precisely where to play the note on the instrument. The problem of generating tablature from a symbolic music representation involves inferring this string and fret assignment per note across an entire composition or performance. On the guitar, multiple string-fret assignments are possible for most pitches, which leads to a large combinatorial space that prevents exhaustive search approaches. Most modern methods use constraint-based dynamic programming to minimize some cost function (e.g.\ hand position movement). In this work, we introduce a novel deep learning solution to symbolic guitar tablature estimation. We train an encoder-decoder Transformer model in a masked language modeling paradigm to assign notes to strings. The model is first pre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned on a curated set of professionally transcribed guitar performances. Given the subjective nature of assessing tablature quality, we conduct a user study amongst guitarists, wherein we ask participants to rate the playability of multiple versions of tablature for the same four-bar excerpt. The results indicate our system significantly outperforms competing algorithms.

Summary

The paper introduces a Transformer-based approach that leverages masked language modeling to tackle the challenge of converting MIDI to guitar tablature.
It employs a two-stage training process with extensive tablature datasets and beam search to enhance string-assignment accuracy.
A user study and quantitative metrics demonstrate superior playability and ground truth agreement over traditional tab-generation software.

MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling

The paper "MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling" by Drew Edwards, Xavier Riley, Pedro Sarmento, and Simon Dixon presents a novel approach to the problem of generating guitar tablature from symbolic music notation by employing a deep learning technique. The researchers propose the use of an encoder-decoder Transformer model, leveraging masked language modeling to predict string assignments for given musical notes.

Summary

The problem addressed centers on converting a symbolic musical performance, such as MIDI data, into guitar tablature. Given that any pitch can be played at multiple positions on the guitar, this task presents significant combinatorial challenges. Traditional methods rely on optimization of hand stretch or movement using constraint-based dynamic programming. In contrast, the presented research bypasses these traditional approaches by applying a contemporary machine learning solution.

The proposed methodology involves several key components:

Model Architecture and Training:
- An encoder-decoder Transformer model based on the BART architecture is utilized.
- The model is trained using a two-stage approach. Initially, pre-training is performed on the DadaGP dataset, consisting of over 25,000 guitar tablatures.
- Fine-tuning follows on a curated set of professionally transcribed guitar performances to refine the model's predictions.
Inference and Post-processing:
- Inference is conducted through a quintile-based prediction mechanism, combined with beam search to improve string-assignment accuracy.
- A post-processing heuristic is used to ensure notes are assigned to playable string-fret combinations, addressing rare but significant prediction errors.
Evaluation:
- Performance is evaluated with a user paper, alongside quantitative metrics such as agreement with ground truth and comparison against existing software like Guitar Pro 8, MuseScore, and TuxGuitar.
- Results indicate the proposed system outperforms existing methods, achieving strong preference among guitarists.

Quantitative and Qualitative Analysis

Quantitative Results:

The system demonstrated significant improvements in alignment with professional transcriptions in terms of agreement with the ground truth and preference by guitarists:

Agreement with Ground Truth: Approximately 73.58% of predictions matched the ground truth assignments.
Chords and Playability: The model displayed a moderate tendency towards larger fret stretches in some predictions, though the majority of chords fell within acceptable playability limits.

Qualitative Results:

A user paper involving 15 guitarists provided substantial support for the practical utility of the model:

Playability Ratings: Participants rated the model's outputs higher than those from commercial software solutions, with the system achieving a mean playability score of 6.04 as compared to 3.32-4.69 from other systems.
Subjectivity in Tablature: Variability in individual preferences for fingerings and positions was noted, emphasizing the subjective nature of tablature assessment.

Implications and Future Directions

The findings from this paper indicate several practical and theoretical implications for the field of guitar tablature generation and music transcription broadly:

Modeling Playability Enhancement: The results suggest a promising direction for enhancing the model to incorporate playability more effectively, potentially through loss functions that better represent physical constraints or alternative model architectures.
Generative Capabilities: The method has the potential to be adapted for automatic guitar arrangement generation, supporting diverse tuning systems and enriched musical expressions through machine learning.
Integration Across Modalities: Future work integrating audio and video data could lead to better alignment with the authentic performances and more accurate transcriptions.

Conclusion

The research presented in this paper marks a significant advance in the approach to automated guitar tablature generation. By utilizing a deep learning-based Transformer model fine-tuned on extensive datasets, the method achieves considerable improvements over traditional optimization-based methods and existing software. The robust evaluation, combining quantitative metrics with qualitative user studies, underscores the model's practicality and effectiveness. Future advancements in this area may include further refining model capabilities to better account for physical playability and expanding its applicability across varied tuning systems and performance contexts.