XMusic Framework: Controllable AI Music Generation

Updated 14 March 2026

XMusic Framework is a multi-modal, controllable system that translates diverse prompts (images, texts, videos, tags, humming) into high-quality symbolic music tokens.
It employs a two-module design with XProjector for prompt parsing and XComposer for autoregressive generation and candidate selection, ensuring fine-grained control over musical attributes.
Leveraging the extensive, annotated XMIDI dataset, XMusic achieves state-of-the-art performance on both objective and subjective evaluations, advancing AI-generated music fidelity.

XMusic is a generalized and controllable symbolic music generation framework designed to advance artificial intelligence-generated content (AIGC) in music by supporting multi-modal prompting (images, videos, texts, tags, humming) and enabling precise emotional and genre control. XMusic is composed of two principal modules: XProjector, responsible for multi-modal prompt parsing, and XComposer, which includes both an autoregressive Transformer Generator and a multi-task Selector. The framework operates on a nuanced symbolic music representation and is trained with XMIDI, a large-scale, meticulously annotated symbolic music dataset (Tian et al., 15 Jan 2025).

1. System Architecture and Design Overview

XMusic implements a unified architecture encompassing multi-modal prompt interpretation, token-based symbolic music modeling, controllable generation, and learned post-selection filtering:

XProjector parses heterogeneous user prompts into a low-dimensional “projection space.” This ensures prompts from any supported modality are consistently mapped to symbolic control signals—emotions, genres, rhythms, and note-level information.
XComposer is split into a Generator (autoregressive decoding with explicit emotional and genre control) and a Selector (multi-task classification of output candidates by quality, emotion, and genre).
Symbolic Music Representation is constructed via a Compound-Word extension, supporting 12 token fields per event including new dimensions for emotional intent, fine-grained rhythm, and instrument program.
Training Data: The 108,023-song XMIDI dataset serves as the backbone, with extensive expert annotation enabling high-fidelity multi-label training.

XMusic is benchmarked both objectively (using metrics such as Pitch-Class Histogram Entropy, Grooving-Pattern Similarity, Empty Beat Rate) and subjectively (multi-aspect human ranking), and achieves state-of-the-art results across all supported prompt modalities.

XProjector acts as a universal “prompt parser,” mapping input from five modalities into control vectors within a shared projection space $\mathcal{P} = \{\mathbb{P}^E, \mathbb{P}^G, \mathbb{P}^R, \mathbb{P}^N\}$ , where:

$\mathbb{P}^E \in \mathbb{R}^{11}$ is an emotion one-hot vector (11 classes).
$\mathbb{P}^G \in \mathbb{R}^6$ is a genre one-hot vector (6 classes).
$\mathbb{P}^R$ contains per-bar and per-beat rhythm descriptors: $p^{bar}_i=(bar\_position, density)$ , $p^{beat}_{i;j}=(beat\_position, tempo, strength)$ .
$\mathbb{P}^N$ captures sequential note descriptors: $p^n_j=(pitch, duration, velocity)$ .

The modality-specific mapping functions $f_{XP}$ employ deep models as follows:

Modality	Parsed Components	Mechanism
Image	$\mathbb{P}^E$	$\mathbb{P}^E \in \mathbb{R}^{11}$ 0
Text	$\mathbb{P}^E \in \mathbb{R}^{11}$ 1	$\mathbb{P}^E \in \mathbb{R}^{11}$ 2
Tag	$\mathbb{P}^E \in \mathbb{R}^{11}$ 3, $\mathbb{P}^E \in \mathbb{R}^{11}$ 4	Direct one-hot lookup
Video	$\mathbb{P}^E \in \mathbb{R}^{11}$ 5, $\mathbb{P}^E \in \mathbb{R}^{11}$ 6	Scene transitions $\mathbb{P}^E \in \mathbb{R}^{11}$ 7 temp/beat; image pipeline per frame
Humming	$\mathbb{P}^E \in \mathbb{R}^{11}$ 8, $\mathbb{P}^E \in \mathbb{R}^{11}$ 9	VOCANO model with quantization/standardization

Images and videos leverage late-fused CNN (ResNet) and cross-modal (CLIP) feature extraction, while text relies on embedding similarity. Video rhythm is derived from scene transitions and optical flow, and humming is converted to quantized symbolic events using VOCANO.

3. Symbolic Music Representation

The event-based symbolic representation extends the Compound-Word tokenization paradigm with the following 12 fields per event:

family‐type
emotion (11 + “ignore”)
genre (6 + “ignore”)
bar_position
beat_position
tempo
chord
density (33 + 1)
strength (37 + 1) 10. program (17 + 1 instruments)
pitch (128 melodic / 128 pseudo-pitch drums)
duration (32 + 1), velocity (44 + 1)

Each attribute is one-hot encoded, projected by a linear layer (embedding dimensions up to $\mathbb{P}^G \in \mathbb{R}^6$ 0), concatenated, augmented with positional encoding, and linearly mapped into the Transformer’s input space.

4. XComposer: Generator and Selector Modules

4.1 Generator

Architecture: A large Transformer-Decoder ( $\mathbb{P}^G \in \mathbb{R}^6$ 1 layers, $\mathbb{P}^G \in \mathbb{R}^6$ 2 heads, $\mathbb{P}^G \in \mathbb{R}^6$ 3).
Autoregressive Decoding: At timestep $\mathbb{P}^G \in \mathbb{R}^6$ $P^{G} \in R^{6}$ 4, history $\mathbb{P}^G \in \mathbb{R}^6$ $P^{G} \in R^{6}$ 5 is embedded:
- $\mathbb{P}^G \in \mathbb{R}^6$ 6
- Next event prediction proceeds sequentially across token fields via softmax, starting with family-type.
Sampling: Temperature-controlled top-k/nucleus sampling.
Loss: Total cross-entropy across all token fields:

$\mathbb{P}^G \in \mathbb{R}^6$ 7

Explicit Control: Emotion and genre “Tag” tokens are inserted at bar-level resolution, obviating need for additional control loss terms.

4.2 Selector

Purpose: Post-hoc filtering of $\mathbb{P}^G \in \mathbb{R}^6$ 8 candidate generations via a multi-task classifier.
Architecture: Transformer-Encoder ( $\mathbb{P}^G \in \mathbb{R}^6$ 9, $\mathbb{P}^R$ 0, $\mathbb{P}^R$ 1).
Outputs: Probabilities for quality (good/bad), emotion (11 classes), genre (6 classes), computed from global time-averaged feature $\mathbb{P}^R$ 2.
Loss: Joint cross-entropy:

$\mathbb{P}^R$ 3

with $\mathbb{P}^R$ 4.

At inference, a candidate with the highest $\mathbb{P}^R$ 5 above a fixed threshold is selected.

5. XMIDI: Dataset Construction and Characteristics

XMIDI underpins XMusic with a scale and label precision enabling multi-faceted controllability:

Size: 108,023 MIDI files, $\mathbb{P}^R$ 65278 hours, average duration 176s.
Labels: 11 emotions (exciting, warm, happy, romantic, funny, sad, angry, lazy, quiet, fear, magnificent), 6 genres (rock, pop, country, jazz, classical, folk), 17 instrument groups.
Acquisition: Sourced from Internet Archive, GitHub, Reddit.
Processing: Multi-phase cleaning (corrupt/empty removal, MD5/chroma deduplication), expert-driven manual filtering.
Annotation: Seven-step expert annotation—standardized definitions/demonstrations, $\mathbb{P}^R$ 73 annotators/file, random 500-file QC ( $\mathbb{P}^R$ 895% accuracy), ongoing calibration.
Comparative Scale: XMIDI is approximately 10× larger than existing emotion-labeled symbolic music corpora (e.g., ELMG, 11,528 files).

6. Evaluation Protocols and Performance

XMusic’s evaluation encompasses both objective musical metrics and comprehensive human listening studies:

Method	PCE (↓)	GS (↑)	EBR (↓)
CP (uncond.)	2.6025	0.9990	0.0273
EMOPIA	2.6756	0.9989	0.1197
XMusic (uncond.)	2.5174	0.9992	0.0045

For video-conditioned generation:

Method	PCE (↓)	GS (↑)	EBR (↓)
CMT	2.7290	0.6698	0.0321
XMusic (video-cond.)	2.6161	0.9983	0.0078

Objective Metrics (Musepy): Lower PCE indicates stronger tonality, higher GS yields more stable rhythm, and lower EBR indicates fewer silent passages.
Subjective Evaluation: 31 listeners, blind A/B/C ranking for richness, correctness, structuredness, emotion-match, rhythm-match. XMusic ranks superior across prompt types (uncond., text, image, video).
Emotion Control: Quadrant-based evaluation (positive/negative valence) yields 76% XMusic vs. 38% EMOPIA correct match (positive valence); 70% vs. 38% (negative).
Ablation Findings: Selector inclusion improves GS and reduces PCE/EBR. Multi-task classification (quality + emotion + genre) yields a quality assessment accuracy of 94.8% versus 83.2% without auxiliary heads. Bar-level Tag control produces superior results (overall rank 1.62 vs. 2.11 for music-level and 2.27 for no control).

A plausible implication is that granular, token-level parameterization and expert-defined annotation at scale are critical for both high-quality generative music modeling and controllable expression.

7. Significance and Context Within Symbolic Music Generation

XMusic distinguishes itself by unifying multi-modal prompt conditioning, fine-grained symbolic representation, and explicit control over affect and genre, addressing previous deficits in controllability and fidelity of AI-generated music. The system provides architectural extensibility through its modular design, scaling effectively via XMIDI and establishing new quality baselines across both automatic and human benchmarks. As the first framework to integrate prompt-based control from five disparate modalities and achieve systematic, bar-level token control, XMusic represents a significant step toward generalized, high-fidelity symbolic music generation (Tian et al., 15 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XMusic Framework.