MuseTok: Discrete Music Representation

Updated 21 October 2025

MuseTok is a discrete representation learning framework for symbolic music that segments bar-wise events into compact latent codes.
Its Transformer encoder–decoder architecture with residual vector quantization enables high-fidelity reconstruction and captures abstract musical concepts.
The framework supports controllable music generation and semantic tasks like melody extraction, chord recognition, and emotion analysis.

MuseTok is a discrete representation learning framework for symbolic music, employing residual vector quantization over bar-wise musical segments within a Transformer encoder–decoder architecture. This approach yields compact codes that facilitate both high-fidelity reconstruction and the capture of abstract musical concepts pertinent to generation and semantic understanding tasks.

1. Architectural Foundation and Mathematical Formulation

MuseTok processes symbolic music encoded as REMI+ sequences and partitions the input into B bars: $X = \{ X_1, X_2, \ldots, X_B \}$ , with each $X_b$ capturing the music events within the $b$ -th bar. Each bar is encoded to a latent vector $z_b$ using a Transformer encoder $P_e$ : $z_b = P_e(X_b)$ . These latent vectors are discretized via residual vector quantization (RQ), utilizing D sequential codebooks:

For the first codebook: $c_b^1 = \arg\min_k \| z_b - e_k^1 \|$
For deeper codebooks, with the residual updated after each quantization: $c_b^d = \arg\min_k \| z_b - e_k^d - \sum_{i=1}^{d-1} r_b^i \|$

The quantized embedding for each bar is composed additively from codebook vectors: $r_b = \sum_{d=1}^D r_b^d$ .

The aggregated bar embeddings are decoded autoregressively with a Transformer decoder $P_\delta$ to reconstruct the original REMI+ sequence, optimizing the negative log-likelihood:

$L_{recon} = - \sum_t \log P_\delta(x_{t+1} | x_{\leq t}; r_{\leq b}), \quad b = \text{bar}(t)$

A commitment loss $L_{commit}$ , such as SimVQ with the rotation trick, further aligns the encoder representation to codebook choices:

$L_{commit} = \sum_{d=1}^D \sum_{b=1}^B \| z_b - \text{sg} [\sum_{d'=1}^d r_b^{d'} W^d ] \|_2^2$

The final training objective is $L = L_{recon} + L_{commit}$ .

2. Music Generation and Semantic Understanding Workflows

Generation

MuseTok employs a two-stage pipeline:

A Transformer decoder $P_\gamma$ predicts sequences of MuseTok codes (discrete tokens) from an initial primer, generating $\{ c_1^1, c_1^2, ..., c_B^D \}$ .
The pretrained decoder $P_\delta$ decodes these codes to REMI+ events, producing full symbolic music. This separation allows high-level structural planning and supports long-context generation, leveraging the compactness of discrete code representations.

Semantic Understanding

MuseTok codes serve as input features for downstream classifiers in:

Melody extraction: Classifies note/pitch events per bar as vocal melody, instrumental melody, or accompaniment using $r_b$ as contextual input.
Chord recognition: Assigns chord labels per beat based on bar embeddings, enhancing harmony extraction from polyphonic music.
Emotion recognition: Aggregates $\{ r_1, ..., r_B \}$ across a song to estimate high/low positiveness and activation, supporting affective analysis.

3. Performance Metrics and Comparison with Prior Work

MuseTok’s Large configuration (e.g., $D=16$ codebooks) approaches or surpasses upper bounds set by a 128-dimensional VAE in perplexity and reconstruction accuracy for complex polyphonic textures. On semantic understanding tasks, MuseTok outperforms prior models—such as MIDI-BERT, MusicBERT, and RNN-based baselines—in chord recognition and emotion classification. Melody extraction performance is mixed, suggesting further refinements are needed to better capture melodic detail.

4. Qualitative Analyses of MuseTok Codes

MuseTok’s discrete music codes reveal underlying musical concepts through several analyses:

Code usage frequency: The top-50 activated codes differ markedly among texture groups (monophonic, chorale, polyphonic). First codebooks exhibit invariance across time signatures, while deeper codebooks differentiate them, indicating progressive granularity in musical structure representation.
Embedding similarity under transposition: Cosine similarity of code embeddings for original and pitch-shifted samples demonstrates that the first codebook maintains >70% similarity across semitone shifts, evidencing invariance to absolute pitch and sensitivity to rhythmic/contour attributes. Deeper codebooks diverge, with similarity peaking at musically significant intervals (e.g., major thirds, perfect fourths), reflecting repetition and structural regularities without explicit supervision.

5. Implications for Research and Future Applications

MuseTok’s framework enables multiple advancements:

Adaptive tokenization: Evidence suggests different music styles may require varying quantization approaches, which could further optimize generative and analytical performance.
Enhanced controllability and retrieval: The compact, semantically rich codes facilitate controllable generation, efficient music retrieval, and robust symbolic music processing—key for large-scale datasets and interactive applications.
Cross-modal and semantic modeling: The discrete architecture is suited for integration with LLMs or cross-modal systems (e.g., text-to-music frameworks), leveraging the learned musical structure inherent to MuseTok’s quantization.
Potential applications: The method could underpin systems for efficient symbolic music generation, in-depth musicological analysis, and automated semantic annotation in digital music libraries.

6. Summary and Context

MuseTok advances symbolic music processing by employing residual quantization over bar-segments, yielding discrete music codes that encode high-level structure, rhythm, harmony, and affect. Its integration of a quantized encoder-decoder pipeline achieves state-of-the-art performance in several music understanding tasks and reconstructs complex polyphonic textures with high fidelity. The analyses of its learned codebooks indicate substantial abstraction of musical concepts, opening future directions in adaptive tokenization, improved generative control, and integration with multimodal AI systems. The findings position MuseTok as a robust foundation for next-generation symbolic music generation and semantic analysis (Huang et al., 18 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding (2025)

MuseTok: Discrete Music Representation

1. Architectural Foundation and Mathematical Formulation

2. Music Generation and Semantic Understanding Workflows

Generation

Semantic Understanding

3. Performance Metrics and Comparison with Prior Work

4. Qualitative Analyses of MuseTok Codes

5. Implications for Research and Future Applications

6. Summary and Context

Whiteboard

Follow Topic

Continue Learning

MuseTok: Discrete Music Representation

1. Architectural Foundation and Mathematical Formulation

2. Music Generation and Semantic Understanding Workflows

Generation

Semantic Understanding

3. Performance Metrics and Comparison with Prior Work

4. Qualitative Analyses of MuseTok Codes

5. Implications for Research and Future Applications

6. Summary and Context

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics