Music Transformer (1809.04281v3)

Published 12 Sep 2018 in cs.LG, cs.SD, eess.AS, and stat.ML

Abstract: Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter.

Citations (450)

View on Semantic Scholar

Summary

The paper introduces an enhanced relative self-attention mechanism that reduces computational complexity from O(L²D) to O(LD), enabling efficient long-term music modeling.
Experiments on datasets like JSB Chorales and Piano-e-Competition show state-of-the-art perplexity and improved music coherence.
Results demonstrate the model's ability to generate expressive, 60-second piano performances, advancing AI-driven music composition.

Music Transformer: Generating Music with Long-Term Structure

The paper "Music Transformer: Generating Music with Long-Term Structure" proposes an innovative adaptation of the Transformer model to the domain of music generation. Recognizing that music requires the modeling of intricate long-term dependencies, the authors address the Transformer’s limitations with respect to sequence length and positional encoding.

Technical Contributions

The core contribution of this work lies in enhancing the relative self-attention mechanism to efficiently handle longer sequences. Traditional methods incurred a significant memory overhead, with a $O(L^2D)$ complexity where $L$ is the sequence length and $D$ the hidden size. The authors propose a new algorithm that reduces this complexity to $O(LD)$ by introducing an efficient "skewing" method. This improvement is critical for practical applications, particularly in music where sequences can be extremely long.

Experiments and Results

The authors validate their model using two datasets: JSB Chorales and the Piano-e-Competition dataset. The Transformer model with relative attention outperforms existing architectures like LSTMs, achieving state-of-the-art results in terms of perplexity on these datasets. For instance, they demonstrate the capability to model expressive piano performances spanning 60 seconds, significantly improving on past efforts that only modeled 15-second segments.

Moreover, the paper highlights the qualitative advantages of relative attention in generating music. The enhanced self-attention mechanism allows for the capture of periodicity and intricate structures, leading to sample outputs perceived by human evaluators as more coherent compared to those generated by baseline models.

Applications and Implications

This research signifies a substantial step forward in using deep learning for music generation, showcasing the Transformer’s ability to handle complex sequences with relative ease. The implications extend beyond music, suggesting potential improvements in other domains requiring long-term dependency modeling, such as time-series analysis.

Future Directions

Looking ahead, the authors speculate that their model could inspire further exploration into relative attention mechanisms for various forms of data beyond symbolic music. The ability to generalize across different sequence lengths without training on each specific length suggests broader applications, possibly in real-time music accompaniment systems or interactive composition tools.

In conclusion, the Music Transformer represents a significant advance in the application of AI to music generation, effectively addressing key challenges in sequence modeling and providing a robust framework for further innovation in AI-driven creativity.

PDF Markdown

Related Papers

YouTube

Show All Videos