PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations (2410.02060v1)

Published 2 Oct 2024 in cs.SD, cs.LG, and eess.AS

Abstract: We introduce Cadenza, a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas as well as unconditional generations. To accomplish this we propose a novel MIDI encoding method, PerTok (Performance Tokenizer) that captures minute expressive details whilst reducing sequence length up to 59% and vocabulary size up to 95% for polyphonic, monophonic and rhythmic tasks. The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer. The Composer model is a transformer-based Variational Autoencoder (VAE), with Rotary Positional Embeddings (RoPE)ROPE and an autoregressive decoder modified to more effectively integrate the latent codes of the input musical idea. The Performer model is a bidirectional transformer encoder that is separately trained to predict velocities and microtimings on MIDI sequences. Objective and human evaluations demonstrate Cadenza's versatile capability in 1) matching other unconditional state-of-the-art symbolic models in musical quality whilst sounding more expressive, and 2) composing new, expressive ideas that are both stylistically related to the input whilst providing novel ideas to the user. Our framework is designed, researched and implemented with the objective of ethically providing inspiration for musicians.

Summary

The paper introduces a dual-layer MIDI encoding framework that separates composition and performance tokens for enhanced expressivity.
It demonstrates that PerTok reduces sequence length by up to 59% and vocabulary size by up to 95% compared to existing tokenizers.
Evaluation shows that the Cadenza framework generates human-like musical variations with high expressivity as confirmed by listener studies.

Overview of "PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations"

The paper, authored by Julian Lenz and Anirudh Mani, introduces Cadenza, a generative framework for modeling expressive variations of symbolic musical ideas. Focused on addressing the 'development' phase of music creation, the authors propose a novel MIDI encoding method termed PerTok (Performance Tokenizer) that efficiently encapsulates expressive musical details while reducing sequence and vocabulary sizes significantly. The framework involves a Composer model and a Performer model, each playing a distinct role in the generative process.

PerTok: MIDI Encoding Method

PerTok is a key contribution of this paper, designed to address inefficiencies in existing MIDI tokenizers. Compared to established methodologies like REMI and TSD, PerTok reduces sequence length by up to 59% and vocabulary size by up to 95%. It separates composition and performance tokens, enabling a model to distinguish between quantized, score-level events and expressive, performance-level nuances such as micro-timings and velocities. This dual-layer tokenization is more aligned with typical music production scenarios and eliminates the need to rely on tempo tokens for expressive variation modeling.

Cadenza Framework

The framework consists of two primary models:

Composer Model: A VAE with transformer architecture leveraging RoPE for advanced positional embedding. It encodes input musical ideas into latent representations that guide the generation of stylistically coherent, yet novel musical variations. Free bits regularization is utilized to manage the trade-off between preserving original input features and introducing novelty.
Performer Model: A bidirectional transformer encoder predicting performance-specific tokens, refining score-level MIDI data to include human-like expressivity. This model adopts a masked token approach akin to BERT to enhance the subtle performance characteristics.

Experimental Evaluation

The paper presents detailed evaluations. Composer ablations demonstrate how different KL regularization strategies affect musical similarity to input data. Balanced KL regularization achieves an optimal trade-off, providing meaningful variations. Performer fidelity is validated through comparative analysis of predicted versus original dataset expressive characteristics, with results indicating effective modeling of unique expressive patterns from small datasets.

Human Evaluation

In user studies, Cadenza competes favorably in expressivity against models like AMT and Figaro, evidencing particularly high human-like expressivity ratings. Although Cadenza's musical appeal matches existing models, it notably excels in generating human-like performance characteristics.

Implications and Future Directions

The paper's contributions include setting a new standard for MIDI tokenization and expressive modeling in symbolic music generation. The research underlines the importance of small, expressive token vocabs for efficient generative modeling, paving the way for advanced AI-assisted composition tools that respect artistic workflows.

Future work could explore expanded control over expressive parameters and improved long-form sequence generation, enhancing its utility for diverse music genres. Collaborations with artists are encouraged to refine model capabilities according to real-world musical requirements, emphasizing ethical considerations in generative AI's role within creative industries.