TokenSynth: Neural Audio Tokenization

Updated 9 April 2026

TokenSynth is a token-based neural synthesizer enabling instrument cloning, text-to-instrument synthesis, and timbre manipulation via autoregressive Transformer modeling and cross-modal conditioning.
It tokenizes audio using a VQ-VAE codec and MIDI signals with multiple token representations, integrating CLAP embeddings to provide unified timbre control.
Experimental results demonstrate robust zero-shot performance on polyphonic synthesis with competitive metrics in multi-scale spectral loss, CLAP score, and MIDI transcription F-score.

TokenSynth refers to a token-based neural synthesizer architecture designed for instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation. Distinguished by its integration of neural audio tokenization, MIDI symbolic conditioning, and cross-modal timbre embeddings (via CLAP), TokenSynth is realized as a decoder-only Transformer autoregressively modeling discrete audio tokens. The system achieves polyphonic, zero-shot instrument cloning and text-driven timbre generation without fine-tuning and provides a unified solution for diverse audio synthesis and sound design tasks (Kim et al., 13 Feb 2025).

1. Audio and Conditioning Tokenization Pipeline

TokenSynth operates on a pipeline that tokenizes both audio and MIDI inputs and maps timbre-related information into a joint embedding space:

Neural Audio Codec: Utilizes the Descript Audio Codec (DAC)—a VQ-VAE with residual vector quantization—to compress continuous raw audio $x\in[-1,1]^T$ into a grid of discrete tokens $a\in\{1,\dots,K_a\}^{N\times D}$ . At each frame ( $n$ ) and codebook depth ( $d$ ), the encoder’s latent $z_{n,d}$ is quantized as $a_{n,d} = \arg\min_{j\in\{1...K_a\}} \|z_{n,d} - c_{d,j}\|^2$ , resulting in $D$ token streams capturing coarse-to-fine spectral structure.
MIDI Tokenization: Following the MT3 scheme, every MIDI note is represented as four tokens: absolute onset (500 bins), absolute offset (500 bins), pitch (128 bins), and velocity (4 bins), with a polyphonic MIDI segment of $n$ notes yielding $M=4n$ tokens $m\in\{1...K_m\}^M$ .
CLAP Timbre Embedding: A pretrained CLAP encoder $a\in\{1,\dots,K_a\}^{N\times D}$ 0 maps either reference audio or a text prompt into a timbre embedding $a\in\{1,\dots,K_a\}^{N\times D}$ 1, which is projected to the model dimension by an MLP: $a\in\{1,\dots,K_a\}^{N\times D}$ 2.

This tokenization enables seamless conditioning on both explicit symbolic performance (via MIDI) and implicit or cross-modal timbral descriptors (via audio or text).

2. Model Architecture and Input Arrangement

TokenSynth adopts a standard decoder-only Transformer setup with the following characteristics:

Model Hyperparameters:
- $a\in\{1,\dots,K_a\}^{N\times D}$ 3 Transformer layers
- $a\in\{1,\dots,K_a\}^{N\times D}$ 4
- $a\in\{1,\dots,K_a\}^{N\times D}$ 5 attention heads
- $a\in\{1,\dots,K_a\}^{N\times D}$ 6
- Dropout $a\in\{1,\dots,K_a\}^{N\times D}$ 7
- Total parameters $a\in\{1,\dots,K_a\}^{N\times D}$ 8M
Input Representation: At each inference step, the Transformer receives a concatenated sequence $a\in\{1,\dots,K_a\}^{N\times D}$ 9, where $n$ 0 is the projected timbre embedding, $n$ 1 are embedded via a MIDI token table $n$ 2, and $n$ 3 have their own depth-specific embedding tables $n$ 4. Learned positional encodings are added throughout. Delay patterns, as in MusicGen, are used to interleave codebook depths among the audio tokens.
Autoregressive Objective: The model predicts the next audio token $n$ 5 via a softmax over the output of the Transformer stack conditioned on prior tokens, MIDI, and timbre embedding:

$n$ 6

Causal masking guarantees strict autoregressive generation over the audio token stack, while symbolic and timbral tokens are available as conditioning context from the start.

3. Training and Inference Procedures

Training Objective: The principal loss is summed cross-entropy over all frame and codebook positions:

$n$ 7

To avoid performance leakage, the CLAP embedding $n$ 8 is extracted from reference audio/text sharing the instrument identity but with independent performance from the target waveform.

Inference Modes:
- Instrument Cloning: $n$ 9 is extracted from a reference audio clip, $d$ 0. The input MIDI is tokenized. Audio tokens are sampled autoregressively (using nucleus/top- $d$ 1 sampling) and decoded to waveform via the DAC decoder.
- Text-to-Instrument: $d$ 2 is computed from a text prompt via CLAP; otherwise identical to cloning.
- Timbre Manipulation: Leveraging CLAP’s shared embedding space, timbre can be smoothly interpolated:
$d$ 3

This allows the output timbre to shift continuously between instrument reference and target text descriptions, with $d$ 4.
Guidance Techniques: Classifier-free guidance and a first-note guidance heuristic interpolate logits at the onset of the first note, with guidance weight $d$ 5 to improve timbre adherence.

4. Evaluation Metrics and Experimental Results

TokenSynth is evaluated with three primary metrics:

Metric	Definition	Lower/Better
Multi-Scale Spectral Loss (MSS)	$d$ 6 distance between mel-spectrograms at multiple FFT sizes	Lower is better
CLAP Score	Cosine similarity in CLAP embedding space	Higher is better
F-score	Precision/recall on note onset/offset MIDI matches	Higher is better

Instrument Cloning: “Dry” training with true-reference yields MSS=0.569, CLAP=0.860, F=0.643; “Augmented” training improves F=0.837 but CLAP drops to 0.845. “Wet” augmentation (audio with effects) causes CLAP scores to decrease; CLAP embeddings lack effect detail.
Text-to-Instrument: CLAP=0.179 for TokenSynth; lower than in cloning, reflecting the inherent text/audio embedding gap. Augmentation increases F (adherence to input MIDI) but not CLAP, confirming the challenge of cross-modal matching.

A key finding is robust zero-shot capability: the system clones unseen instruments and performs polyphonic synthesis from MIDI without fine-tuning. MSS and transcription F-score reflect accurate timbre and performance, while CLAP score quantifies embedding similarity but underestimates effect (wet/dry) nuances.

5. Strengths, Limitations, and Future Directions

Strengths:

Zero-shot cloning of unseen instruments and direct polyphonic synthesis.
Unification of audio-based cloning, text-guided synthesis, and timbre interpolation under a single Transformer with shared cross-modal conditioning.
Smooth interpolation between audio and text timbre prompts; supports sound design flexibility.

Limitations:

Non-real-time generation: requires the complete MIDI context in advance.
Autoregressive sampling incurs temporal drift from strict MIDI timing.
Velocity quantization limited to 4 discrete levels, dictated by dataset constraints.
CLAP embedding omits detailed audio-effect cues, reducing realism for “wet” sources.

Potential Advances:

Streaming/real-time generation architectures.
Enhanced velocity encoding/granularity.
Richer cross-modal embeddings capturing fine-grained effects or extended timbral nuances.

6. Relation to Broader Token-Based Synthesis Paradigms

TokenSynth exemplifies a broader paradigm shift toward discrete neural acoustic/symbolic token modeling for music and audio generation. This approach is paralleled in non-autoregressive models such as VampNet’s masked token modeling for music (Garcia et al., 2023) and image-style token modulation frameworks in visual synthesis (Zeng et al., 2021). A distinguishing feature of TokenSynth is its explicit cross-modal timbre conditioning (audio/text via CLAP) and capacity for smooth, zero-shot interpolation between modalities, positioning it at the intersection of symbolic control, cross-modal retrieval, and downstream generative synthesis.