OctupleMIDI Encoding

Updated 7 November 2025

OctupleMIDI encoding is a note-centric, 8-tuple representation that compresses all key musical attributes into a single token per note.
It achieves significant efficiency by reducing average token sequences by up to 75% compared to traditional MIDI encodings, benefiting Transformer and VAE models.
The design captures polyphonic and multi-track structures, enabling superior handling of long-range dependencies and improved performance on tasks like melody completion and genre classification.

OctupleMIDI encoding is a two-dimensional, note-centric symbolic music representation in which each musical note is compactly and uniquely described by an 8-tuple discrete token. Originally introduced as a core component of MusicBERT (Zeng et al., 2021) and subsequently adopted in Multi-view MidiVAE (Lin et al., 15 Jan 2024), OctupleMIDI provides substantial efficiency in input sequence length and expressivity for large-scale symbolic music modeling and generation. Its design directly addresses the structural and contextual complexities of polyphonic, multi-track MIDI data by encoding all note-relevant musical parameters as a single atomic unit. This enables efficient modeling by neural architectures, notably Transformers and VAEs, across a variety of symbolic music understanding and generation tasks.

1. Motivation and Rationale

The proliferation of symbolic music data in MIDI format reveals rich structural (bars, positions, time signatures) and diverse (tempo, instruments, pitches) information that is not efficiently captured by traditional event-based encodings. Earlier representations such as pianoroll, MIDI-like, and REMI encode music as linearized event sequences, assigning multiple tokens per note due to discrete events for note-on, note-off, time-shifts, bar markers, etc. For long or multi-track pieces, this results in prohibitively long token sequences—REMI-like encoding averages 15,679 tokens per song [(Zeng et al., 2021), Table 1]—exceeding practical attention window constraints in Transformer-based models and creating inefficiency in both training and inference.

OctupleMIDI was developed to eliminate token redundancy by making the note—not the musical event—the fundamental unit of representation. This approach bundles all key attributes for a note into a single octuple token, dramatically reducing input length (to 3,607 tokens per song on average) while maintaining expressivity across genres, time signatures, and styles (Zeng et al., 2021, Lin et al., 15 Jan 2024). In effect, OctupleMIDI resolves the context-length bottleneck in symbolic music modeling and captures long-range dependencies, global and local structure, and simultaneity among notes.

2. Distinctiveness from Previous MIDI Encodings

Encoding	Avg. Tokens per Song	Structure
OctupleMIDI	3,607	Note-centric 8-tuple
CP-like	6,906	Compound event
REMI-like	15,679	1D event

Traditional event-based representations (MIDI-like, REMI) are inherently redundant, as each note typically requires multiple tokens for onset, velocity, duration, instrument changes, and temporal positioning. CP-like (Compound Word) encodings merge some note attributes but are not wholly note-centric nor sufficiently compact for efficient sequence modeling. By contrast, OctupleMIDI compresses all relevant note information into a single token per note, capturing explicit relationships via simultaneous encoding of instrument, bar, position, and other attributes. This yields a sequence length reduction of approximately 75% compared to REMI and a 357% increase in inference speed (Lin et al., 15 Jan 2024).

3. Structure and Formal Definition

Each note is encoded as an octuple (8-tuple) of discrete categorical variables, representing its musical attributes:

$\mathbf{t}_j = (\text{Time Signature}, \text{Tempo}, \text{Bar}, \text{Position}, \text{Instrument}, \text{Pitch}, \text{Duration}, \text{Velocity})$

Attribute Space

Element	Description	Value Range / Token Count	Notes
Time Signature	Bar meter (fraction)	254 possible (numerator 1–128, denominator 1–64)	Adapts up to 2 whole notes/bar
Tempo	BPM (discretized)	49 (16–256 BPM, geometric sequence)
Bar	Index of bar in song	0–255
Position	Note onset position within bar	128 (1/64-note steps)
Instrument	MIDI/GM instrument ID or percussion	0–127 (GM), 128 (drum set)
Pitch	Note or percussion pitch/type	128 (pitches/types per instrument group)
Duration	Note duration (mixed resolution)	128 (fine for short, coarse for long)	0 for percussion
Velocity	MIDI velocity (quantized)	32 (2,6,10,…,126)

In the context of generation or understanding, each octuple token uniquely defines the musical event, and the set of such tokens forms a compact, lossless representation of the MIDI content. In neural architectures (e.g., MusicBERT), input octuple tokens are embedded per element and concatenated; an input vector per note is formed via a linear projection of these concatenated embeddings. Output predictions utilize separate softmax heads for each of the 8 elements, enabling simultaneous multi-attribute modeling.

4. Application in Transformer and VAE-based Architectures

OctupleMIDI encoding is compatible with both unidirectional and bidirectional sequence models. In MusicBERT (Zeng et al., 2021), a Transformer-based model is pre-trained on over 1 million symbolic music pieces using OctupleMIDI. The model receives the compact sequence of octuple tokens, allowing substantially longer contexts (spanning entire songs) and facilitating advanced masking and predictive strategies, including bar-level masking. The octuple architecture supports bar-aware positional modeling and attribute-factored predictions.

In Multi-view MidiVAE (Lin et al., 15 Jan 2024), OctupleMIDI serves as the basis for dual-view representation learning. Here, the full sequence of octuple tokens per piece is reorganized by 'track-view' (grouped by instrument) and 'bar-view' (grouped by bar), yielding matrices $(S_t, S_b)$ via transformations:

$S_t = X_t \cdot S, \qquad S_b = X_b \cdot S$

where $S$ is the original sequence, and $X_t, X_b$ are indexing matrices for track and bar grouping. This dual-view approach enables the VAE to capture both instrumental characteristics and temporal/harmonic structure.

5. Empirical Impact and Comparative Performance

The adoption of OctupleMIDI has direct quantifiable benefits in both efficiency and downstream performance. In MusicBERT (Zeng et al., 2021), ablation studies demonstrate that OctupleMIDI yields improved accuracy and F1 on four benchmark tasks: melody completion, accompaniment suggestion, genre classification, and style classification.

Encoding	Melody Completion (Acc.)	Accompaniment Suggestion (Acc.)	Genre (F1)	Style (F1)
REMI-like	92.0	86.5	0.689	0.487
CP-like	95.7	87.2	0.719	0.510
OctupleMIDI	96.7	87.9	0.730	0.534

OctupleMIDI enables state-of-the-art performance, particularly on song-level (global context) tasks, attributed to the increased context window and more efficient parameterization. Phrase-level task improvements are more modest, reflecting shorter context dependencies, but the representation still affords significant computational cost reductions. Across both MusicBERT and Multi-view MidiVAE (Lin et al., 15 Jan 2024), OctupleMIDI demonstrates efficiency, improved modeling of harmonic and polyphonic relationships, and enhanced subjective generation quality.

6. Representational Universality, Advantages, and Limitations

OctupleMIDI is designed for extensibility and applicability across varied symbolic music datasets and genres. Its encoding natively supports variable time signatures and tempos, the full General MIDI instrument set (including percussion), arbitrary pitch and note durations, and both global and local musical context. This universality underpins its successful application in datasets ranging from monophonic to complex multi-track, multi-bar pieces.

Advantages

Drastic reduction in sequence length (by up to 75% over REMI-like (Lin et al., 15 Jan 2024))
Explicit encoding of simultaneity and multi-attribute context for each note
Suitability for long-form, multi-track symbolic music in deep generative models (Transformers, VAEs)
Empirical gains in accuracy and computational efficiency

Limitations

Does not encode lower-level expressive events (such as pedal, pitch bend) natively
Fixed-attribute schema may require extension to accommodate future musical annotations (e.g., lyrics, articulation)
Packing multiple attributes per token can increase sparsity, potentially leading to overfitting on less repetitive data
Implementation requires careful preprocessing and possibly custom embedding/modeling strategies to handle the octuple composition and its 2D organization

7. Summary

OctupleMIDI encoding represents a critical evolution in symbolic music representation. By encoding each note as an information-rich octuple, it achieves both brevity and expressivity, permitting modern neural architectures to efficiently model long, polyphonic, multi-track works. Comparative studies (Zeng et al., 2021, Lin et al., 15 Jan 2024) establish its superiority over previous event-based encodings in terms of both computational tractability and downstream task performance. Its principal innovations enable fine-grained, context-aware understanding and generation of symbolic music, supporting contemporary developments in large-scale music modeling and generative systems.

PDF Markdown Chat (Pro)

References (2)

MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training (2021)

Multi-view MidiVAE: Fusing Track- and Bar-view Representations for Long Multi-track Symbolic Music Generation (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to OctupleMIDI Encoding.