MIDI-VALLE: Expressive Piano Synthesis
- MIDI-VALLE is a neural codec language model that transforms MIDI sequences into realistic piano audio by mapping expressive performance attributes using dual AR and NAR transformer decoders.
- It employs a discrete tokenization scheme for both MIDI (via octuple encoding) and audio (via Piano-Encodec with residual vector quantization) to capture fine expressive details.
- Robust training on the diverse ATEPP dataset and superior evaluation metrics underscore its advancement over traditional MIDI-to-audio synthesis approaches.
MIDI-VALLE is a neural codec LLM for expressive piano performance synthesis that transforms MIDI performance sequences into highly realistic audio by leveraging discrete token-based representations and neural codec modeling. Framed as an advancement over traditional MIDI-to-audio synthesis approaches, MIDI-VALLE adapts and extends the VALLE framework (originally designed for zero-shot personalized text-to-speech synthesis) to the music domain, conditioning synthesis on both reference audio and its aligned MIDI performance. The model achieves a robust mapping between musical intent and rendered acoustic output, addressing longstanding challenges in generalizing across diverse MIDI sources, performance styles, and acoustic environments (Tang et al., 11 Jul 2025).
1. Architectural Design and Conditioning Paradigms
MIDI-VALLE modifies the standard two-stage performance synthesis paradigm by directly modeling the mapping from expressive MIDI tokens to audio codec tokens, conditioned on reference audio and associated MIDI. This design enables the model to capture not only the symbolic content of the MIDI performance but also interpretative and acoustic nuances present in a reference excerpt.
The architecture features two primary decoders:
- Autoregressive (AR) Transformer Decoder: Processes MIDI input along with an acoustic prompt (a three-second segment from the target performance with corresponding MIDI), predicting primary audio codec tokens causally, token-by-token.
- Non-Autoregressive (NAR) Transformer Decoder: Conditions on the acoustic prompt and completes the remaining three quantizer token streams for finer acoustic details.
Both decoders share a backbone of 12 attention layers, each with 16 heads and hidden dimensions of 1024. During training, the objective maximizes the conditional likelihood , where represents the full 2D codec token array, is the MIDI token sequence, and is the prompt audio’s codec matrix. The AR decoder establishes the coarse structure of the output, while the NAR decoder refines timbral and temporal resolution.
A high-level process is as follows:
- Concatenate the reference audio codec tokens and MIDI tokens as model input.
- Use the AR decoder to predict the main quantizer tokens, generating a coarse but expressive audio reconstruction.
- Invoke the NAR decoder to fill in higher-resolution audio details, guided by information from the acoustic prompt.
This design facilitates robust transfer and adaptation to various input and recording conditions, capturing intricate expressive and stylistic features.
2. Data Representation: Discrete Tokenisation Methods
MIDI-VALLE employs discrete tokenisation schemes for both input (MIDI) and output (audio), enhancing alignment and expressivity relative to previous piano roll-based systems.
MIDI Encoding: Octuple Method
The Octuple MIDI encoding tokenizes performance data into features including pitch, velocity, note duration, inter-onset interval (IOI), position, and bar markers. Each feature forms a separate vocabulary, producing a compact representation for notes and features. By explicitly encoding timing and dynamic features, this method preserves microtiming and expressive nuances that are lost in grid-based (piano roll) discretizations.
Audio Encoding: Piano-Encodec via Residual Vector Quantisation (RVQ)
For audio, MIDI-VALLE adopts the Piano-Encodec codec—a four-level residual vector quantiser. Each RVQ level uses a distinct codebook; tokens from higher levels progressively add detail to the approximation established by lower levels. This hierarchical tokenization captures both the coarse structure and fine timbral subtleties of piano audio performances, enabling accurate and flexible synthesis while maintaining manageable vocabulary sizes and facilitate alignment with the symbolic representation.
This discrete paradigm enables joint and consistent modeling of MIDI and audio, increasing robustness to input variations and improving synthesis fidelity.
3. Training Corpus: Dataset Scope and Diversity
The model is trained on the ATEPP dataset, which provides the following:
- Scale: Approximately 700 hours of audio from 1,099 albums, encompassing over 1,500 classical compositions and 46 pianists.
- Diversity: A broad range of styles, performance interpretations, and recording conditions to ensure generalization.
- Data Preparation: All recordings are segmented into 15–20 second clips, with precise temporal alignment between audio and MIDI performance transcriptions.
This varied dataset enables MIDI-VALLE to learn explicit mappings between complex performance gestures in MIDI and their acoustical realization, contributing to its robust cross-domain generalization.
4. Evaluation: Metrics and Comparative Performance
Evaluation of MIDI-VALLE combines objective computational metrics and large-scale perceptual testing:
- Objective Metric: Frechet Audio Distance (FAD), a recognized quantitative measure of overall audio similarity and distributional alignment, is used on the ATEPP and Maestro datasets. MIDI-VALLE achieves over 75% lower FAD versus a state-of-the-art baseline (M2A), indicating significant advances in modeling realistic audio output from MIDI.
- Additional Metrics: Spectrogram and chroma distortions, providing measures of fidelity in spectral and harmonic content.
- Listening Test: In subjective evaluations, participants strongly preferred MIDI-VALLE over the baseline (202 votes vs. 58), citing greater realism and expressivity in synthesized performance.
These results demonstrate both statistically and perceptually that the system matches ground truth performance properties, including timbre, phrasing, and expressive variation, more closely than previous models.
5. Generalization Properties and Robustness
MIDI-VALLE exhibits improved robustness across diverse musical and acoustic conditions:
- The use of discrete tokens for both domains facilitates a high-resolution mapping that captures subtle timing and articulation, rather than imposing fixed quantization or losing fine temporal information.
- Conditioning on a short audio prompt (reference performance) enables the system to adapt zero-shot to unseen recording environments, timbres, or expressive styles.
- Training on a diverse corpus leads to effective generalization not only across transcribed classical performance MIDI but also to other styles and conditions; however, limitations remain for non-classical genres, such as jazz, which may require further data.
This suggests that MIDI-VALLE’s architecture lends itself well to transfer across performers and acoustics, where earlier systems based on piano rolls or single-codec modeling would fail to account for such nuances.
6. Applications and Future Directions
MIDI-VALLE supports a spectrum of practical and research applications:
- Music Production: Synthesis of realistic, expressive piano audio directly from MIDI allows composers, arrangers, and producers to render performance-quality audio from symbolic scores.
- Expressive Performance Synthesis: The explicit capture of timing, dynamics, and articulation enables research into human expressiveness as well as novel interactive performance tools.
- Integration into Synthesis Pipelines: MIDI-VALLE functions as the synthesis stage in multi-model music generation pipelines, converting EPR-derived performance MIDI into rendered audio.
- Interactive Systems and Sound Design: The adaptability to variable style and acoustic context suggests potential use in live simulation environments or for creative sound design.
- Research into Neural Audio Synthesis: The joint modeling of discrete MIDI and audio tokens offers a blueprint for future systems targeting domains requiring alignment of symbolic and audio information.
A plausible implication is that MIDI-VALLE’s discrete, codec-driven approach will stimulate the development of more general neural audio synthesis models capable of handling a broader set of instruments, styles, and expressive gestures with fine granularity.
7. Comparative and Contextual Analysis
Relative to prior work—including TTS-based synthesis using piano rolls and dense audio codes—MIDI-VALLE advances the field by:
- Maintaining consistency and high fidelity through joint discrete tokenization.
- Leveraging acoustic prompts for robust style and timbral adaptation.
- Demonstrating superior performance in both automated and human evaluation.
Whereas earlier systems struggled to maintain nuance or generalize beyond training data, MIDI-VALLE achieves both, establishing a new state of the art in expressive piano performance synthesis from MIDI. This positions it as a critical component in next-generation music generation, understanding, and production systems.