Piano-Encodec: Neural Codecs for Piano

Updated 14 September 2025

Piano-Encodec is a neural audio codec framework optimized for encoding piano recordings using RVQ and transformer-based models to capture nuanced performance details.
It employs a streaming convolutional encoder–decoder paired with residual vector quantization to reduce bitrates while preserving harmonic and transient audio characteristics.
The system supports expressive score-to-audio synthesis and accurate transcription through bidirectional modeling and integration of discrete MIDI tokens.

Piano-Encodec is a term that encompasses neural audio codec architectures and discrete token representations particularly suited for the efficient encoding and decoding of piano music in both synthesis and transcription scenarios. Originally built upon the EnCodec neural codec framework (Défossez et al., 2022), Piano-Encodec leverages residual vector quantization (RVQ) and transformer-based entropy modeling to capture the complex timbral, dynamic, and temporal characteristics unique to piano recordings. Recent advances, such as MIDI-VALLE (Tang et al., 11 Jul 2025), utilize discrete codec tokens for high-fidelity, expressive performance synthesis, facilitating robust score-to-audio translation, conditioning, and adaptation across diverse musical styles and recording environments.

1. Neural Codec Architecture for Piano Music

The foundational codec, EnCodec (Défossez et al., 2022), operates as a fully streaming convolutional encoder–decoder pipeline. Piano-Encodec inherits this structure, optimized for the demands of piano music:

Encoder: A sequence of strided 1D convolutions increases temporal receptive field and channel count, followed by stacked LSTM layers for long-range sequence modeling. The final encoder maps audio $x$ to a latent sequence $z\in\mathbb{R}^{B\times D\times T}$ .
RVQ Quantization: Latent sequences are discretized in multiple stages, yielding code indices $[B, N_q, T]$ , effectively capturing subtle nuances and reducing bitrate.
Decoder: Reconstructs the waveform from quantized tokens, operating in real time.

Objective Function

The model combines multiple signal-matching losses:

$L_G = \lambda_t \cdot \ell_t(x, \hat{x}) + \lambda_f \cdot \ell_f(x, \hat{x}) + \lambda_g \cdot \ell_g(\hat{x}) + \lambda_{\text{feat}} \cdot \ell_{\text{feat}}(x, \hat{x}) + \lambda_w \cdot \ell_w$

where the multi-scale frequency-domain loss

$\ell_f(x, \hat{x}) = \frac{1}{|\alpha||s|} \sum_{\alpha_i \in \alpha} \sum_{i\in e} \left[\|S_i(x) - S_i(\hat{x})\|_1 + \alpha_i \|S_i(x) - S_i(\hat{x})\|_2\right]$

prioritizes preservation of harmonic and transient details—essential for piano sound.

2. Discrete Token Representations: Audio and MIDI

MIDI-VALLE (Tang et al., 11 Jul 2025) demonstrates the efficiency of pairing Piano-Encodec audio tokens with discrete MIDI tokens via an extended Octuple MIDI tokenization. This tokenization captures pitch, velocity, duration, inter-onset interval, position, and bar information in a multi-channel array, aligned to codec audio tokens. The RVQ-based audio token matrix $C_{T \times 4}$ enables dual conditioning and allows accurate timbral and expressive score rendering.

Advantages of Discrete Tokenization

Temporal Consistency: Dicrete time steps enable precise alignment between score and audio features.
Robust Generalization: Multi-level discrete representation is more resilient to style, interpretation, and recording condition variability than piano-roll approaches.
Bidirectional Modeling: Facilitates both score-to-audio and audio-to-symbolic (transcription) tasks.

3. Transformer-Based Entropy and Conditional Modeling

Lightweight transformer models are employed for entropy coding and conditional synthesis:

Compression: Transformers predict conditional probabilities of codebook indices for arithmetic coding, reducing bitrates by up to 40% while supporting real-time operation (Défossez et al., 2022).
Conditional Generation: In MIDI-VALLE (Tang et al., 11 Jul 2025), a two-stage transformer decoder (autoregressive and non-autoregressive) is conditioned both on MIDI tokens and an acoustic prompt (a segment from reference audio), enabling style and phrasing transfer.

Conditional modeling with acoustic prompts and discrete token sequences significantly improves synthesis quality and adapts to diverse inputs.

4. Performance Evaluation and Subjective Assessment

Empirical evaluation is provided using objective metrics and listening studies:

Fréchet Audio Distance (FAD): MIDI-VALLE achieves over 75% lower FAD compared to state-of-the-art baselines, demonstrating the effectiveness of Piano-Encodec's discrete token approach.
MUSHRA Scores: EnCodec-based models achieve scores exceeding 83 (out of 100) at low bitrates for music domains, surpassing Opus and Lyra-v2 (Défossez et al., 2022).
Subjective Listening Tests: In pairwise preference tests, MIDI-VALLE was preferred by a 202:58 ratio over baseline models for expressive performance synthesis.

These results validate both perceptual transparency and technical fidelity for piano-specific applications.

5. Applications in Expressive Performance Synthesis and Transcription

Piano-Encodec has catalyzed a new generation of expressive performance synthesis and robust transcription systems:

Synthesis: Models can generate realistic, nuanced audio performances from symbolic scores, adapting to stylistic context and recording environment.
Performance Rendering: Integration with EPR systems and downstream generative frameworks allows conditioning on arbitrary performance, instrument, and prompt information, providing flexibility for virtual instrument implementations.
Transcription: Discrete token codecs can be inverted for audio-to-score transcription, supporting robust event alignment and facilitating end-to-end symbolic extraction and editing workflows.

6. Future Research Trajectories

A plausible implication is the expansion of Piano-Encodec-inspired methods to:

Multi-instrument Domains: Extending discrete codec tokenization, entropy modeling, and acoustic conditioning to other instrument classes and ensembles.
Cross-modal Representation: Developing systems for joint modeling of symbolic, audio, and notation tokens, potentially bridging content creation, analysis, and digital editing pipelines.
Improved Style Transfer and Real-Time Applications: Leveraging efficient transformer architectures, RVQ, and prompt-based conditioning for interactive music synthesis, real-time score following, and performance cloning.

7. Implementation and Open-source Resources

Reference implementations and pretrained checkpoints for EnCodec and its derivatives are available at github.com/facebookresearch/encodec and WX-Wei/efficient-seq2seq-piano-trans, supporting further research, benchmarking, and application development in neural audio coding specific to piano music.

In summary, Piano-Encodec denotes the application of neural codec architectures and discrete token-based language modeling approaches tailored for high-fidelity piano music encoding, synthesis, and transcription. By leveraging streaming convolutional encoder–decoder pipelines, RVQ quantization, compact symbolic representations, and transformer-based conditional modeling, it achieves robust generalization, musicality, and real-time performance critical for modern music information retrieval and generation tasks.

PDF Markdown Chat (Pro)

References (3)

High Fidelity Neural Audio Compression (2022)

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling (2025)

Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms (2025)

Follow Topic

Get notified by email when new papers are published related to Piano-Encodec.