Multitrack Music Transformer (2207.06983v4)

Published 14 Jul 2022 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: Existing approaches for generating multitrack music with transformer models have been limited in terms of the number of instruments, the length of the music segments and slow inference. This is partly due to the memory requirements of the lengthy input sequences necessitated by existing representations. In this work, we propose a new multitrack music representation that allows a diverse set of instruments while keeping a short sequence length. Our proposed Multitrack Music Transformer (MMT) achieves comparable performance with state-of-the-art systems, landing in between two recently proposed models in a subjective listening test, while achieving substantial speedups and memory reductions over both, making the method attractive for real time improvisation or near real time creative applications. Further, we propose a new measure for analyzing musical self-attention and show that the trained model attends more to notes that form a consonant interval with the current note and to notes that are 4N beats away from the current step.

Citations (32)

View on Semantic Scholar

Summary

The paper proposes a new tuple-encoded representation that reduces memory requirements and supports generating longer multitrack compositions.
The model employs a decoder-only architecture with autoregressive prediction, significantly boosting inference speed compared to traditional methods.
Evaluation shows that despite slightly lower subjective scores, the model can produce up to 3.5 times longer samples, enabling real-time music applications.

Multitrack Music Transformer: Enhancements in Multitrack Music Generation

The rapid advancements in transformer models have significantly influenced the field of symbolic music generation, particularly in the field of multitrack compositions. The paper "Multitrack Music Transformer" presents a novel approach that addresses existing challenges in generating multitrack music via transformers, focusing on issues such as the number of instruments, segment length, and inference speed. Here, we will explore the methodology, results, and implications of the research, providing a comprehensive overview suitable for experienced researchers in the domain.

Methodology and Model Design

The central innovation in this paper is the introduction of a new data representation for multitrack music, aimed at reducing memory requirements and enabling the generation of longer and more intricate compositions. The Multitrack Music Transformer (MMT) employs a decoder-only architecture with multi-dimensional inputs and outputs, diverging from standard one-dimensional transformer implementations. The model represents each musical piece as a sequence of tuple-encoded events, incorporating parameters such as type, beat, position, pitch, duration, and instrument. This compact representation significantly surpasses previous models in accommodating longer musical sequences under constrained GPU memory limits.

The MMT model is trained to predict these tuples in an autoregressive manner, facilitating tasks such as unconditioned generation, instrument-informed generation, and N-beat continuation. The training leverages a learnable positional embedding and a cross-entropy loss function to optimize the generation across different fields. This approach enables the model to produce music more efficiently than prior methods, as it reduces the per-note generation complexity.

Results and Evaluation

The evaluation of the MMT was conducted through subjective listening tests and objective metrics. In subjective tests, the MMT was compared to two baseline models, MMM and REMI+, both of which rely on traditional multitrack representations. Although the MMT model's subjective scores on coherence, richness, and arrangement were slightly reduced compared to REMI+ (with a mean opinion score of 3.33 versus 3.77 for REMI+), the model demonstrated significant improvements in generating longer musical samples, achieving 2.6 to 3.5 times the sample length of its counterparts, and increased inference speed (11.79 versus 5.66 and 3.58 notes per second for MMM and REMI+, respectively).

Objective evaluations measured metrics such as pitch class entropy and scale consistency, designed to mimic human perception of musical quality. While REMI+ scored closer to the ground truth in these measures, the MMT maintained competitive performance, underscoring its capability in generating musically coherent output while benefiting from reduced computational overhead.

Implications and Future Directions

The most striking implication of this research is the MMT's enhanced capability for real-time music applications, making it suitable for scenarios requiring quick improvisation or interactive music co-creation. The new representation and model structure signify a meaningful step towards more efficient transformer-based music generation, promising broader application across diverse musical styles and settings.

Furthermore, the systematic analysis of musical self-attention within the transformer framework uncovers potential avenues for optimizing self-attention mechanisms in symbolic music generation. By demonstrating how the MMT learns specific musical properties—such as beat alignment and pitch harmony—the paper lays the groundwork for future research to explore refining these aspects.

The research presents a compelling case for integrating transformers in multitrack music generation, balancing the trade-off between inference speed and musical quality. It invites continued exploration into longer-form and real-time capable music generation, promising exciting developments in the evolution of AI's role in music composition.

PDF Markdown

Related Papers

YouTube

Show All Videos