ATEPP Dataset: Expressive Piano Synthesis

Updated 14 September 2025

ATEPP is an extensive dataset of aligned score MIDI, performance MIDI, and audio recordings from classical piano works that supports expressive synthesis research.
It employs a refined tokenisation scheme capturing six musical dimensions to facilitate Transformer-based modeling and accurate performance rendering.
The dataset enhances model generalization through rigorous objective and subjective evaluations, influencing state-of-the-art systems like MIDI-VALLE.

The ATEPP dataset is an extensive corpus of professionally transcribed classical piano performances, designed to facilitate research and development in expressive piano performance synthesis. ATEPP enables both the learning and evaluation of expressive modeling, MIDI-to-audio synthesis, and system generalization by providing aligned score MIDI, performance MIDI, and audio recordings spanning a broad range of performers, composers, and recording conditions.

1. Dataset Structure and Composition

ATEPP comprises approximately 8,825 recordings sourced from 1,099 albums, which in aggregate represent more than 700 hours of classical piano music. The curation focuses on high fidelity and expressiveness, emphasizing professionally recorded performances rather than amateur or synthesized data.

The dataset is subdivided into distinct experimental subsets for system design and benchmarking:

Subset	Contents	Primary Usage
A	371 performances (75 albums), diversity in acoustics	Fine-tuning neural MIDI synthesizer (M2A)
B	Mainly Beethoven sonatas and one Mozart piece	Training Expressive Performance Rendering (M2M), baseline modeling

Each audio file is paired with precise performance MIDI and corresponding score MIDI, forming aligned triplets necessary for controlled learning and generation.

Note-wise alignment is achieved via an algorithm inspired by Nakamura et al., resulting in triplets that enable direct mapping from musical symbols (score MIDI) to expressive human performances (performance MIDI) and further to acoustically faithful audio.

2. Data Processing, Segmentation, and Tokenisation

ATEPP leverages a modified Octuple tokenisation scheme to encode six critical dimensions of musical expressiveness: pitch, velocity, duration, inter-onset interval (IOI), bar, and position. Key details include:

Score MIDI velocity is fixed (value: 60) to neutralize any expressiveness present in the source encoding.
IOI vocabulary is reduced by discretizing time at a beat resolution of 96.
MIDI sequences are split into 256-note segments for efficient batch processing and model input, with segments concatenated in reconstruction to preserve global temporal structure.

This rigorous tokenisation enables usage with Transformer-based architectures and facilitates probabilistic sequence modeling of expressive deviations.

3. Experimental Use Cases: Training and Fine-tuning

ATEPP supports two classes of model development:

3.1 Expressive Performance Rendering (M2M Model)

The Transformer-based M2M model is trained on subset B to learn mapping from mechanical score MIDI to performance MIDI reflecting human expressiveness. The model includes performer identity embeddings that are summed (rather than concatenated) with the hidden state, paralleling speaker-embedding strategies from TTS to enforce stylistic variety.

Duration prediction targets precise note lengths rather than deviations, simplifying post-generation processing steps. Training employs probabilistic losses with temperature scaling and nucleus sampling to enhance output diversity. Objective evaluation uses Kullback-Leibler divergence (KLD), Pearson correlation, and Dynamic Time Warping Distance (DTWD).

3.2 Neural MIDI Synthesis (M2A Model)

Originally pretrained on Maestro, the M2A model is fine-tuned using subset A to accommodate the broader acoustic diversity of ATEPP. Performance MIDI is converted into spectrograms, then into audio waveforms using HiFi-GAN. Audio and MIDI pairs are segmented into 9.6-second blocks and concatenated, selecting optimal joins via cross-correlation for smooth transitions.

Audio fidelity evaluation employs spectrogram mean square error and Chroma metrics. Fine-tuning demonstrates a significant reduction in spectrogram error (from 0.318 ± 0.013 to 0.262 ± 0.009).

4. Evaluation Methodology: Objective and Subjective Metrics

Two main evaluation paradigms are employed:

Objective Metrics:

For M2M: KLD, correlation, DTWD calculated at both segmented and full-performance levels to assess the preservation of expressive features.
For M2A: Chroma and spectrogram errors are compared pre- and post-fine-tuning to quantify fidelity gains.

Subjective Listening Tests:

Expressiveness and overall audio quality assessed using Mean Opinion Scores (MOS) across reference, baseline, and synthesized outputs.
The integrated system (M2M + fine-tuned M2A) restores expressiveness compared to mechanical MIDI, though certain acoustic nuances (e.g., pedalling, key sound) trail sophisticated references (e.g., Pianoteq).

5. Impact on Model Generalization and System Design

The diversity and alignment precision of ATEPP are central to state-of-the-art system generalization. For example, training on ATEPP allows neural synthesis models to adapt robustly to unseen MIDI sources, performers, compositional styles, and variable recording environments (Tang et al., 11 Jul 2025). The presence of ambient acoustic detail—captured by including recordings from both concert halls and studios—facilitates realistic audio reproduction.

Models such as MIDI-VALLE leverage ATEPP to dramatically improve generalization and expressivity, attaining over 75% lower Frechet Audio Distance (FAD) compared to prior baselines on ATEPP and Maestro. Subjective evaluations indicate strong listener preference for MIDI-VALLE outputs.

6. Contributions and Ongoing Challenges

ATEPP advances expressive piano synthesis in several ways:

Enables detailed learning of performance parameters—velocity, timing, dynamics—across diverse stylistic and acoustic contexts.
Supports fine-tuning and adaptation of neural synthesis models, bridging the gap between symbolic control and acoustic realism.
Facilitates reliable comparison of modeling approaches through aligned evaluation splits and standardized metrics.

Challenges remain such as accurately capturing pedalling effects, key-specific acoustics, and full physical realism at the level of leading synthesis engines (e.g., Pianoteq). These motivate ongoing research into higher-resolution tokenisation, multimodal learning, and unsupervised expressiveness modeling.

7. Significance and Future Directions

The ATEPP dataset represents a pivotal resource in computational musicology and AI-driven expressive audio synthesis. Its alignment of score, performance, and audio data enables dissecting and reconstructing human expressiveness. As cited in recent work (Tang et al., 17 Jan 2025, Tang et al., 11 Jul 2025), expanding ATEPP with additional genres, instruments, and multimodal (score–audio–video) annotations is likely to further enhance the fidelity and versatility of automated performance rendering systems.

A plausible implication is that the architecture and methodologies established with ATEPP will inform cross-domain generative modeling—enabling expressiveness transfer, style adaptation, and invariant synthesis across broader musical and acoustic domains.