Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs (2101.02402v1)

Published 7 Jan 2021 in cs.SD, cs.AI, and eess.AS

Abstract: To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pitch, duration, velocity (dynamics), and placement (onset time) along the time grid. While different types of tokens may possess different properties, existing models usually treat them equally, in the same way as modeling words in natural languages. In this paper, we present a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types. And, we propose a new Transformer decoder architecture that uses different feed-forward heads to model tokens of different types. With an expansion-compression trick, we convert a piece of music to a sequence of compound words by grouping neighboring tokens, greatly reducing the length of the token sequences. We show that the resulting model can be viewed as a learner over dynamic directed hypergraphs. And, we employ it to learn to compose expressive Pop piano music of full-song length (involving up to 10K individual tokens per song), both conditionally and unconditionally. Our experiment shows that, compared to state-of-the-art models, the proposed model converges 5--10 times faster at training (i.e., within a day on a single GPU with 11 GB memory), and with comparable quality in the generated music.

Citations (161)

View on Semantic Scholar

Summary

The paper introduces the Compound Word Transformer, a novel architecture that groups music tokens into "compound words" and uses dynamic directed hypergraphs to efficiently compose full songs.
This model achieves 5--10 times faster training convergence on a single GPU with comparable output quality compared to existing state-of-the-art music generation methods.
The research demonstrates the value of tailoring neural architectures to data domain specifics, offering potential for efficient, complex music composition and applications in other attribute-rich data domains.

Compound Word Transformer: Advancements in Full-Song Music Composition

The paper entitled "Compound Word Transformer: Learning to Compose Full-Song Music Over Dynamic Directed Hypergraphs" explores an innovative method of music generation through the use of a specialized Transformer architecture. This research bridges an important gap in neural sequence models applied to music by recognizing and leveraging the distinct nature of music tokens compared to natural language tokens.

Core Approach and Methodology

The authors introduce a new approach to music token representation, which involves grouping related tokens into "compound words." This means that musical elements like pitch, duration, and velocity are represented together, allowing the model to process sequences more efficiently. The architecture of this proposed model is fundamentally different from traditional models that treat each token equally, akin to text processing. This distinction is crucial because musical notes, unlike words in text, embody various attributes—such as rhythm and dynamics—that cannot be ignored.

To manage this complexity, the paper describes how each type of token is modeled using different feed-forward heads in a Transformer decoder architecture. This differentiation allows the model to handle the properties of each token type individually and effectively. The expansion-compression trick employed here reduces sequence length significantly, enabling efficient computational modeling of full-song compositions.

Numerical Results and Model Efficiency

One of the striking results presented is the model’s enhanced training efficiency. The Compound Word Transformer converges 5--10 times faster than existing state-of-the-art models, using a single GPU with 11 GB of memory. This is particularly impressive given the scale of the compositions involved, which can include up to 10,000 individual tokens per song. Despite this increased efficiency in training and inference time, the quality of generated music is comparable to that produced by existing methods.

Implications and Future Directions

Practically, this research holds promise for more efficient music generation processes, making the composition of lengthy and complex pieces feasible with limited computational resources. This can be particularly transformative in settings where resources are constrained. Theoretically, the use of dynamic directed hypergraphs introduces a novel perspective on sequence modeling, which might find applications beyond music, in domains where entities or events are represented by multiple interconnected attributes.

The paper also opens avenues for further exploration in adaptive representation and prediction. Future developments in AI can look into extending the compound word framework, allowing even more expressive representations without compromising computational feasibility. There's potential in investigating how the model's theoretical underpinnings in graph neural networks could enhance other aspects of AI, possibly inspiring new architectures or learning strategies.

Conclusion

The "Compound Word Transformer" establishes a distinct method for modeling music sequences by explicitly accounting for token-type heterogeneity. While ensuring model efficiency and maintaining output quality, this research emphasizes the importance of tailoring neural architectures to suit the specific attributes of the data domain. As AI continues to innovate and tackle complex tasks, methodologies like this exemplify how domain-specific adaptations can yield significant advancements in performance and applicability.