Next-Token Prediction in Tokenized MIDI

Updated 19 September 2025

Next-token prediction on tokenized raw MIDI is a method that serializes musical events to enable autoregressive generation and transcription.
It leverages advanced tokenization, including temporal quantization and compound tokens, paired with Transformer architectures for accurate sequence prediction.
Training exploits teacher forcing with cross-entropy loss, while extensions like multi-token prediction and interval-based encoding address error accumulation and enhance global planning.

Next-token prediction on tokenized raw MIDI is a fundamental paradigm for symbolic music generation and modeling, directly adapting autoregressive sequence learning from NLP to the domain of symbolic music. In this approach, musical content is serialized as a sequence of discrete tokens—each encoding MIDI events such as note on/offs, durations, velocities, and other metadata. The model, typically a neural LLM like a Transformer, is trained to predict the next token in the sequence given the preceding context, enabling applications such as music generation, continuation, and transcription. The tokenization process, expressivity of model architectures, training objectives, and downstream applications each play critical roles in the success, limitations, and practical implementation of this approach.

1. Tokenization of Raw MIDI for Next-Token Prediction

Tokenization is the essential preprocessing step for applying sequence models to raw MIDI data. Frameworks such as MidiTok offer a unified API for tokenizing MIDI files into flexible, discrete token sequences. The process involves:

Parsing MIDI Files: Extracting note on/off events, velocities, time signatures, tempo changes, instruments, and metadata.
Temporal Quantization: Transforming event timings from ticks to musically meaningful units (e.g., quantized steps), frequently by $t_q = \mathrm{round}(t_\mathrm{raw}/\Delta)\cdot\Delta$ with resolution $\Delta$ .
Pitch and Velocity Discretization: Normalizing or binning these parameters to fit within a finite token vocabulary.
Event Linearization and Encoding: Arranging events into sequences, optionally interleaving multi-track data or encoding tracks separately. Each event is mapped to a unique token in a structured vocabulary (e.g., via REMI or MIDI-like schemes). Augmentation options include encoding additional musical factors (chords, bars, tempo changes) and data augmentation (pitch shifting, velocity scaling).

Advanced tokenization strategies extend beyond atomic event representations. NG-Midiformer, for example, applies unsupervised compoundation to merge frequently co-occurring event families into "word-like" compound tokens (UCW), reducing sequence lengths while capturing richer local semantics (Tian et al., 2023).

2. Model Architectures and Objectives

The next-token prediction paradigm is dominated by encoder-decoder and decoder-only transformer architectures:

Decoder-Only Models: Utilize autoregressive causal self-attention to fit $P(x_t \mid x_1, ..., x_{t-1})$ , training via cross-entropy loss over the sequence. This is the canonical formulation:

$\mathcal{L} = -\sum_{t=1}^T \log P(x_t \mid x_1, x_2, ..., x_{t-1})$

Encoder-Only Variants (ENTP): Recent work demonstrates ENTP architectures recompute full self-attention over the prefix at each step, potentially capturing richer context interactions and complex dependencies that decoder-only models cannot, albeit with higher computational complexity (Ewer et al., 2 Oct 2024).
Multi-Token Prediction and Feature Injection: Proposals such as Future Token Prediction (FTP) ask the model to predict several future tokens at once, promoting smoother and more semantically coherent embeddings and improved long-context planning (Walker, 23 Oct 2024). Similarly, NG-Midiformer injects n-gram structural information via a dedicated N-gram transformer encoder and position matrix, enriching context representations for each prediction (Tian et al., 2023).
Interval-Based Tokenization: Instead of absolute pitches, encoding relative intervals (both horizontal and vertical) can increase generalization and interpretability. Formally, for event $e = (p, t)$ relative to a reference $e_j = (p_j, t_j)$ , interval tokens are constructed as $I(p_j, p_{j-1}) = p_j - p_{j-1}$ for references and $I_{\mathrm{non-ref}}(p, p_j) = p - p_j$ for other events, supporting transposition invariance and music-theoretic explainability (Le et al., 8 Jan 2025).

3. Training Regimes and Objectives

Training on tokenized MIDI for next-token prediction generally follows:

Teacher Forcing: Models are optimized to predict each next token given the ground truth prefix, maximizing chain rule likelihood.
Challenges in Training: The distinction between teacher-forced training and true autoregressive generation is critical. Certain tasks can induce "Clever Hans" phenomena, where the model exploits spurious cues from the training setup, failing to learn genuine long-range planning (Bachmann et al., 11 Mar 2024). For complex, non-local dependencies (e.g., global musical structure), alternative objectives—such as multi-token prediction, sequence reversal, or masked prediction—may mitigate these shortcomings.
Loss Formulations: Standard regime employs token-wise cross-entropy. Advanced strategies leverage masked language modeling (MLM), chunked loss optimization, or diffusion objectives (for continuous or embedded tokens) to enhance robustness and dependency modeling. The general next-token prediction loss is:

$\mathcal{L}_{\text{NTP}}(\theta) = \mathbb{E}\left[ -\sum_{t=1}^T \log P_\theta(x_t \mid x_1,...,x_{t-1}) \right]$

Capacity and Scaling Laws: The representational capacity is bounded theoretically. For a decoder-only Transformer with $k$ parameters and vocabulary size $\omega$ , the number of exactly "memorized" distinct contexts is limited by $k/(\omega - 1)$ , suggesting that practical model and vocabulary scaling must be aligned with data and task complexity to reach the entropy lower bound (Madden et al., 22 May 2024).

4. Use Cases, Performance, and Limitations

Primary applications include:

Autoregressive Generation: Music transformers and similar models, fed by token streams from packages like MidiTok, can generate musically plausible continuations or entirely new sequences.
Conditional and Contextual Generation: By inserting chord, style, or instrument tokens, the model can be controlled to generate music satisfying user-specified conditions.
Inpainting and Imputation: Next-token models can "fill in" missing portions, perform harmonization, or generate contrapuntal lines, leveraging the autoregressive paradigm.
Symbolic Transcription: Sequence-to-sequence architectures built on next-token prediction can transcribe audio or incomplete MIDI inputs to full musical representations.

Results from the literature demonstrate that even lightweight, task-specific models (e.g., a 20M parameter decoder-only RWKV trained via simple next-token prediction on piano MIDI) can match or outperform larger foundation models on constrained continuation tasks. This performance is attributed to the alignment between the modeling objective, tokenization, and the constrained data domain (Zhou-Zheng et al., 13 Sep 2025).

Autoregressive next-token prediction is, however, not without fundamental limitations:

Limitation	Origin/Symptom
Error amplification	Compounding error grows with sequence length under model misspecification (Ω(H) scaling) (Rohatgi et al., 18 Feb 2025)
Lack of global planning	Teacher-forced models may fail in tasks requiring long-term structure (Clever Hans effect) (Bachmann et al., 11 Mar 2024)
Creativity bottleneck	Models tend to "play it safe," favoring surface-level coherence over improvisational risk (Olatunji et al., 25 May 2025)

5. Extensions, Innovations, and Alternative Paradigms

Various extensions to the next-token prediction paradigm seek to overcome its limitations:

Masked and Multi-Token Prediction: Training objectives that require the model to predict multiple or masked tokens (MNTP, FTP) can improve long-range semantic coherence, robustness to missing information, and context-aware planning (Walker, 23 Oct 2024, Yang et al., 14 Jul 2025).
Continuous Token Modeling: For audio, token-wise diffusion models operating in continuous latent spaces have advanced the state of the art, suggesting possible inspiration for embedding or diffusion-based next-token objectives in MIDI (Yang et al., 14 Jul 2025).
Adversarial and Interactive Systems: To better capture spontaneity and improvisational dynamics essential for musical creativity, approaches incorporating adversarial imitation learning, feedback-driven systems, or creative beam search are advocated as alternatives to strict next-token predictors (Olatunji et al., 25 May 2025).
Multimodal Formulations: Unified frameworks from multimodal learning literature generalize next-token prediction to music by adopting tokenization schemes, model architectures, and objectives consistent with multimodal LLMs. This sets the stage for joint modeling of music, text, image, and other modalities under a shared sequence modeling paradigm (Chen et al., 16 Dec 2024).

6. Theoretical Analysis: Capacity, Statistical-Computational Tradeoffs, and Practical Guidelines

A rigorous theoretical framework underpins the practical deployment of next-token prediction in tokenized MIDI:

Capacity Bounds: For a model with $k$ parameters and vocabulary size $\omega$ , the maximal number of memorized context sequences $n$ is:

$n \leq \frac{k}{\omega - 1}$

To reach empirical entropy bounds on complex MIDI datasets, the model size must scale linearly with $n \cdot \omega$ .

Statistical-Computational Tradeoffs: Efficient algorithms optimizing token-wise log-loss cannot, under misspecification, avoid error amplification scaling linearly with sequence length $H$ ; information-theoretic minima (constant error) are unattainable without computationally intractable procedures (Rohatgi et al., 18 Feb 2025). Algorithmic modifications, such as chunked objectives, provide partial mitigation at the cost of computation.
Tokenization Choice: Interval-based tokenization offers improved generalization, explainability, and sometimes reduced vocabulary size, but requires careful reconstruction of absolute pitches and management of error accumulation (Le et al., 8 Jan 2025). Compound token (UCW) schemes further compress and structure the sequence for improved modeling efficiency (Tian et al., 2023).

7. Prospects and Open Challenges

Consistent evidence suggests that next-token prediction on tokenized MIDI, paired with strong tokenization and task-specific architectures, remains effective for many symbolic music tasks. However, foundational issues—error accumulation, lack of global planning, and misalignment with human creativity—motivate further research into alternative objectives (multi-token, adversarial, feedback-driven), representation learning methods, and large-scale multimodal models. These directions are crucial for advancing the generation of coherent, musically compelling, and genuinely creative symbolic music beyond the current limits of next-token predictors.