Fundamental Music Embedding (FME)

Updated 14 March 2026

Fundamental Music Embedding is a method that vectorizes symbolic music by explicitly modeling both absolute and relative musical attributes.
Theory-driven interval encodings decompose MIDI pitch differences into interpretable primitives, enabling direct music-theoretic analysis.
Deep autoencoder and sinusoidal formulations yield compact, robust embeddings that support tasks such as retrieval, generation, and style discrimination.

A Fundamental Music Embedding (FME) is a musically meaningful, information-preserving vectorization of symbolic music that explicitly models both absolute and relative musical attributes using compositionally structured encodings or learned representations. FMEs have emerged as a key methodology for integrating structural domain knowledge—musical intervals, time relationships, and stylistic information—into computational models for analysis, retrieval, generation, and stylistic discrimination. Three principal formulations exist: (1) theory-driven interval encodings as in the MusicEmbedding toolkit (HekmatiAthar et al., 2021), (2) deep autoencoder-based embeddings optimized for symbolic sequence modeling (Bretan et al., 2017), and (3) bias-adjusted sinusoidal embeddings supporting translational invariance and compositionality for Transformer-style architectures (Guo et al., 2022).

1. Theory-Driven Interval-Based Formulation

FME in the context of music-theoretic analysis constitutes an explicit mapping from pitch events to interpretable interval primitives (HekmatiAthar et al., 2021). Given two MIDI note pitches $\pi_i, \pi_j\in\mathbb{Z}$ , the raw semitone difference $s_{ij} = \pi_j - \pi_i$ is decomposed into:

Direction $d_{ij}=\mathrm{sign}(s_{ij})\in\{-1,+1\}$
Octave offset $c_{ij} = \lfloor|s_{ij}|/12\rfloor\in \mathbb{N}$
Simple-interval semitone $\bar{s}_{ij}=|s_{ij}|\bmod 12$
Interval ordinal $o_{ij}$ and quality $q_{ij}$ via a fixed Q-table: $(o_{ij},q_{ij}) = Q(\bar{s}_{ij})$

The FME vector is then $e_{ij} = [o_{ij},q_{ij},d_{ij},c_{ij}]^T\in\mathbb{Z}^4$ . This representation enables embedding melodic, harmonic, and bar-relative intervals by appropriate choices of $(\pi_i,\pi_j)$ . Bar-relative intervals encode, for example, a note’s distance from a bar onset—a proxy for tonal stability or off-beat tension.

The MusicEmbedding toolkit operationalizes this approach through interval extraction from MIDI/pianoroll files. Typical downstream usage involves one-hot encoding each dimension, yielding a sparse binary vector of $\sim 20$ –$30$ dimensions. As all features are interpretable, direct music-theoretic constraints may be imposed on machine learning applications (e.g., penalizing tritones, rewarding perfect fifths).

Extensive classical corpora analyses reveal the efficacy of FMEs: minor/major interval ratios clarify composers' harmonic tendencies, and melodic directionality distributions capture characteristic ascent biases (mean descent/ascent ratio ≈ 0.4030) (HekmatiAthar et al., 2021). FMEs, by embedding interval names and functions, subsume raw semitone difference encodings, supporting superior interpretability and facilitating incorporation of explicit music theory into computational workflows.

2. Deep Autoencoder-Derived Embedding Spaces

In autoencoder-based FME, the symbolic musical input is quantized into overlapping, fixed-length windows—specifically, contiguous four-beat segments, each beat quantized to 24 ticks, forming a binary $60 \times 96$ onset matrix (Bretan et al., 2017). This matrix encodes the presence of onsets for 60 MIDI pitches per time step.

A deep convolutional autoencoder forms the FME as follows:

Encoder: Four convolutional layers (first: $12\times 6$ filters, stride $12\times 6$ ; subsequent: same receptive fields with decreasing stride), ELU nonlinearities, batch normalization after each layer, and three fully connected layers, reducing to a 100-dimensional embedding $h=f_e(x)$ .
Decoder: Mirrors the encoder (with tied weights) to reconstruct the original input.
Loss Functions:
- Denoising: Inputs are masked using random note dropout, beat dropout, or octave splitting, and the reconstruction is optimized via a softmax-style contrastive loss over cosine similarities.
- Context reconstruction: The embedding predicts the sum of the immediate predecessor and successor in the sequence; mean squared error is used as the loss.
- Composer regularization: An auxiliary classifier from $h$ predicts composer identity, contributing an additional cross-entropy loss.

The 100-dimensional embedding is a >50 $\times$ compression over the raw $5760$-dimensional input but retains salient information about local harmonic, rhythmic, and sequential patterns (Bretan et al., 2017).

Evaluation uses forward-prediction ranking (LSTM predicts the next embedding, measured against 1000 candidates) and composer classification (frozen embeddings with a small classifier). Context-reconstruction achieves a median forward-prediction rank of ≈15/1000, and pure composer-discrimination yields Micro-F $_1$ ≈ 0.29, Macro-F $_1$ ≈ 0.76.

The resulting embedding is robust (survives note/beat masking), interpretable, and suitable for downstream composition, continuation prediction, retrieval, and style-transfer applications. Extensions encompass adding note durations, expanding context windows, variational/adversarial regularization, polyphonic/multi-instrument generalization, and metadata conditioning.

3. Sinusoidal Translationally-Invariant Embeddings and FME for Transformers

The sinusoidal-based FME is articulated to align with human musical perception, which depends heavily on relative intervals and temporal relationships (Guo et al., 2022). Standard learned embeddings or one-hot vectors fail to guarantee that identical intervals are separated by identical distances in embedding space—a translational invariance crucial for motif and pattern modeling.

FME expresses pitches, durations, and onsets as “fundamental music token” (FMT) types. For each FMT, with input $f$ , the $d$ -dimensional absolute embedding is:

$\text{FME}_F(f) = \left[ \sin(w_k f) + b_{\text{sin},k},\ \cos(w_k f) + b_{\text{cos},k}\right]_{k=0}^{d/2-1}$

with $w_k = B^{-2k/d}$ , $B$ the base (distinct per FMT type), and $b_{\text{sin},k}, b_{\text{cos},k}$ trainable biases.

Relative “shifts” (interval or onset differences) are encoded identically, omitting the biases: $\text{FMS}_F(\Delta f) = \left[ \sin(w_k \Delta f),\ \cos(w_k \Delta f) \right]_{k=0}^{d/2-1}$

Key properties:

Translational invariance: The Euclidean distance between two FME encodings depends only on the interval $|f_a - f_b|$ .
Transposability: Adding a relative shift to an absolute encoding corresponds to a parameterized rotation in embedding space.
Orthogonality: Distinct FMTs (pitch, duration, onset) remain separable owing to distinct base values $B$ .

Implementation involves composing input embeddings (via FME) for use in transformer models (the “RIPO transformer”), replacing or augmenting standard positional encodings and input projections. In attention mechanisms, additional relative bias terms—based on relative pitch and onset embeddings—are incorporated into the logits.

4. Applications and Empirical Evaluations

FMEs have been deployed across autoencoding, transformer-based sequence modeling, and theory-based analysis.

Music Embedding toolkit: Used for large-scale computational musicology (e.g., composer style quantification), sequence compression, and as an analysis primitive for generative or classification models. Interval statistics (minor/major, directional, composer identity) are extracted efficiently (HekmatiAthar et al., 2021).
Deep autoencoder FME: Facilitates predictive modeling (forward-prediction tasks), composer identification, and robust representation learning from large symbolic music corpora (4M+ unique segments) (Bretan et al., 2017).
Sinusoidal FME with RIPO transformer: Yields state-of-the-art results on melody completion (lowest cross-entropy: FME+RIPO, 2.367, vs. Music Transformer+WE, 2.408) and produces generated music more faithful to human patterns, as reflected in both objective sequence metrics (e.g., KL divergence on pitch and duration) and listening test scores (Guo et al., 2022).

Ablation studies confirm the necessity of explicit relative encodings in attention; removing any core RIPO term degrades performance. Unlike standard “word” embeddings or word2vec, FMEs satisfying translational invariance demonstrably improve symbolic melody modeling and avoid collapse into repetitive or degenerative outputs.

5. Theoretical Significance and Limitations

FMEs formalize musically meaningful invariances and compositionality in embedding space. Theory-driven FMEs (interval-based) ensure direct interpretability and immediate connection to established music analytics. Deep-learned and sinusoidal FMEs systematize the absorption of complex, high-dimensional local and global patterns, including voice leading, rhythm, and timbral regularities. Translational invariance and explicit support for compositional manipulation (e.g., transposition-as-rotation) enhance both the expressivity and the generalizability of sequence models.

Reported limitations include:

Sinusoidal encodings’ tendency to oscillate for very large intervals, potentially complicating discrimination over broad pitch ranges (Guo et al., 2022).
Existing FME formulations in RIPO transformer are monophonic; polyphonic extension necessitates more complex geometric or set-based representations.
The deep autoencoder approach implicitly ignores note durations unless multiple channels are incorporated (Bretan et al., 2017).

A plausible implication is that future FME research will include richer hierarchical and cross-modal encodings, extension to polyphonic and multi-instrument data, and combination with masked modeling or pre-training on diverse musical corpora.

6. Comparison to Conventional Embeddings and Future Directions

The contrast with conventional symbolic music representations (raw one-hots, sequence trees, learned embeddings) is substantive. FMEs embed music-theory concepts (interval category, function, bar-position) as first-class dimensions, not merely as post hoc features. This alignment enables more explicit regularization, constraint imposition, and interpretability—allowing direct reward or penalty for compositional structures during generative modeling, and facilitating direct analytical queries in musicology.

Reported future directions include integration of FME with BERT-style masked modeling, scaling to include velocity, articulation, and dynamics, and leveraging FME as a cross-modal bridge for music information retrieval, recommendation systems, and style/genre classification (Guo et al., 2022, Bretan et al., 2017). The persistence of domain knowledge in FME design suggests that model architectures incorporating these embeddings will remain competitive as the field transitions to larger and more heterogeneous musical datasets.