Expressive MIDI-format Piano Performance Generation (2408.00900v2)

Published 1 Aug 2024 in cs.SD, cs.AI, and eess.AS

Abstract: This work presents a generative neural network that's able to generate expressive piano performance in MIDI format. The musical expressivity is reflected by vivid micro-timing, rich polyphonic texture, varied dynamics, and the sustain pedal effects. This model is innovative from many aspects of data processing to neural network design. We claim that this symbolic music generation model overcame the common critics of symbolic music and is able to generate expressive music flows as good as, if not better than generations with raw audio. One drawback is that, due to the limited time for submission, the model is not fine-tuned and sufficiently trained, thus the generation may sound incoherent and random at certain points. Despite that, this model shows its powerful generative ability to generate expressive piano pieces.

Summary

The paper introduces a generative neural network model that produces expressive MIDI piano performances, addressing critiques of symbolic music’s rigidity.
It details innovative data processing techniques such as micro-timing, polyphonic reinterpretation, and perceptually-informed quantization to capture musical nuances.
The model employs a convolved multi-argument LSTM with an integrated attention mechanism to handle multiple musical features, paving the way for adaptive symbolic music generation.

Expressive MIDI-format Piano Performance Generation: An Analytical Overview

The paper entitled "Expressive MIDI-format Piano Performance Generation" offers a detailed exploration into the construction of a generative neural network specifically designed to produce expressive musical performances in MIDI format. The focus on the MIDI format serves a dual purpose: to address the critique concerning symbolic music's lack of expressivity and to provide an alternative path to audio-based music generation systems. The paper highlights novel approaches in data processing and neural network design, setting the stage for significant contributions to the field of symbolic music generation.

Symbolic vs. Audio-Based Music Generation

The paper highlights the prevailing trend towards audio-based music generation, facilitated by tools within DAWs, and addresses the resultant decline in symbolic music generation due to perceived rigidity and lack of flexibility. However, the author presents a compelling case for symbolic music through several arguments:

Absence of Mature Audio Models: There remains a deficiency in models that effectively generate music from audio data, mainly due to the challenges in managing a vast array of samples and defining higher-order music representations.
Untapped Potential of Symbolic Music: While symbolic music is criticized for its perceived stiffness, the paper argues that through improved data processing, MIDI can encapsulate the full musical spectrum, thereby enabling expressive performances.
Optimal High-Level Representation: Unlike audio samples which necessitate feature learning, MIDI inherently contains a high-level representation of music, optimizing input for generative models.
Objective Assessment of Quality: Due to its formal nature, symbolic music aligns well with classical standards, allowing for a more objective evaluation of musical quality.

Innovations in Data Processing

The paper introduces a set of strategies aimed at enhancing the generation of realistic auditory experiences from symbolic input, focusing primarily on the nuances of music performance that go beyond mere note sequences:

Micro-Timing without Fixed Grids: Eschewing traditional fixed grid systems, the model uses precise millisecond timings to overcome the mechanical sound characteristic of quantized systems.
Redefinition of Polyphony and Monophony: The research proposes treating polyphonic music as sequential monophonic events, capturing subtle temporal discrepancies, thus simplifying generation via sequential models.
Inclusion of Control Events: Recognizing the role of the sustain pedal in piano music, the model incorporates this vital control event, enriching the expressive potential of the generated sequences.
Perceptually-Informed Quantization: Borrowing from Weber's law, perceptual-based quantization is applied to ensure even distribution of note features, such as velocity and duration, in alignment with human auditory perception.

Model Architecture: Convolved Multi-Argument LSTM

To tackle the intricacies of expressive music generation, the paper outlines the deployment of an LSTM network, designed to capture long-term dependencies inherent in musical sequences. Notably, the model accommodates multiple input and output features through an integrated attention mechanism, ensuring that interrelated musical elements, such as note value and duration, are not treated independently. This interconnected approach allows the model to produce more coherent and musically intuitive sequences.

Reflections on Symbolic Music Generation

The paper sets forth a critical evaluation of symbolic music generation's limitations as an ultimate endeavor, acknowledging challenges such as:

Instrument Categorization: With the rise of digital music production, physical instrument identity has become increasingly ambiguous, complicating MIDI's role in faithfully encoding performance nuances.
Control Event Complexity: The growing complexity of performance controls challenges MIDI's capacity to accurately capture and replay these nuances.
Diversity in Instrumentation: Modern music's stylistic diversity complicates the predefined nature necessary for symbolic generation, often requiring dynamic adaptation beyond traditional models.

Future Directions

This research serves as a foundation for further exploration into achieving enhanced realism and expressivity in symbolic music generation. It suggests future developments may include more adaptive systems capable of dynamically interpreting and generating music across genres and production styles, potentially integrating advances in real-time machine learning with exploratory approaches to feature representation.

The presented model, despite preliminary results due to limited training, lays the groundwork for more refined and comprehensive systems, directly addressing the criticisms of symbolic music generation and pushing towards an overview of symbolic and audio-based methodologies in the AIM field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/realmofresearch/status/1820614912974217479

YouTube

Show All Videos