TokenSynth: Neural Audio Tokenization
- TokenSynth is a token-based neural synthesizer enabling instrument cloning, text-to-instrument synthesis, and timbre manipulation via autoregressive Transformer modeling and cross-modal conditioning.
- It tokenizes audio using a VQ-VAE codec and MIDI signals with multiple token representations, integrating CLAP embeddings to provide unified timbre control.
- Experimental results demonstrate robust zero-shot performance on polyphonic synthesis with competitive metrics in multi-scale spectral loss, CLAP score, and MIDI transcription F-score.
TokenSynth refers to a token-based neural synthesizer architecture designed for instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation. Distinguished by its integration of neural audio tokenization, MIDI symbolic conditioning, and cross-modal timbre embeddings (via CLAP), TokenSynth is realized as a decoder-only Transformer autoregressively modeling discrete audio tokens. The system achieves polyphonic, zero-shot instrument cloning and text-driven timbre generation without fine-tuning and provides a unified solution for diverse audio synthesis and sound design tasks (Kim et al., 13 Feb 2025).
1. Audio and Conditioning Tokenization Pipeline
TokenSynth operates on a pipeline that tokenizes both audio and MIDI inputs and maps timbre-related information into a joint embedding space:
- Neural Audio Codec: Utilizes the Descript Audio Codec (DAC)—a VQ-VAE with residual vector quantization—to compress continuous raw audio into a grid of discrete tokens . At each frame () and codebook depth (), the encoder’s latent is quantized as , resulting in token streams capturing coarse-to-fine spectral structure.
- MIDI Tokenization: Following the MT3 scheme, every MIDI note is represented as four tokens: absolute onset (500 bins), absolute offset (500 bins), pitch (128 bins), and velocity (4 bins), with a polyphonic MIDI segment of notes yielding tokens .
- CLAP Timbre Embedding: A pretrained CLAP encoder 0 maps either reference audio or a text prompt into a timbre embedding 1, which is projected to the model dimension by an MLP: 2.
This tokenization enables seamless conditioning on both explicit symbolic performance (via MIDI) and implicit or cross-modal timbral descriptors (via audio or text).
2. Model Architecture and Input Arrangement
TokenSynth adopts a standard decoder-only Transformer setup with the following characteristics:
- Model Hyperparameters:
- 3 Transformer layers
- 4
- 5 attention heads
- 6
- Dropout 7
- Total parameters 8M
- Input Representation: At each inference step, the Transformer receives a concatenated sequence 9, where 0 is the projected timbre embedding, 1 are embedded via a MIDI token table 2, and 3 have their own depth-specific embedding tables 4. Learned positional encodings are added throughout. Delay patterns, as in MusicGen, are used to interleave codebook depths among the audio tokens.
- Autoregressive Objective: The model predicts the next audio token 5 via a softmax over the output of the Transformer stack conditioned on prior tokens, MIDI, and timbre embedding:
6
Causal masking guarantees strict autoregressive generation over the audio token stack, while symbolic and timbral tokens are available as conditioning context from the start.
3. Training and Inference Procedures
- Training Objective: The principal loss is summed cross-entropy over all frame and codebook positions:
7
To avoid performance leakage, the CLAP embedding 8 is extracted from reference audio/text sharing the instrument identity but with independent performance from the target waveform.
- Inference Modes:
- Instrument Cloning: 9 is extracted from a reference audio clip, 0. The input MIDI is tokenized. Audio tokens are sampled autoregressively (using nucleus/top-1 sampling) and decoded to waveform via the DAC decoder.
- Text-to-Instrument: 2 is computed from a text prompt via CLAP; otherwise identical to cloning.
- Timbre Manipulation: Leveraging CLAP’s shared embedding space, timbre can be smoothly interpolated:
3
This allows the output timbre to shift continuously between instrument reference and target text descriptions, with 4.
- Guidance Techniques: Classifier-free guidance and a first-note guidance heuristic interpolate logits at the onset of the first note, with guidance weight 5 to improve timbre adherence.
4. Evaluation Metrics and Experimental Results
TokenSynth is evaluated with three primary metrics:
| Metric | Definition | Lower/Better |
|---|---|---|
| Multi-Scale Spectral Loss (MSS) | 6 distance between mel-spectrograms at multiple FFT sizes | Lower is better |
| CLAP Score | Cosine similarity in CLAP embedding space | Higher is better |
| F-score | Precision/recall on note onset/offset MIDI matches | Higher is better |
- Instrument Cloning: “Dry” training with true-reference yields MSS=0.569, CLAP=0.860, F=0.643; “Augmented” training improves F=0.837 but CLAP drops to 0.845. “Wet” augmentation (audio with effects) causes CLAP scores to decrease; CLAP embeddings lack effect detail.
- Text-to-Instrument: CLAP=0.179 for TokenSynth; lower than in cloning, reflecting the inherent text/audio embedding gap. Augmentation increases F (adherence to input MIDI) but not CLAP, confirming the challenge of cross-modal matching.
A key finding is robust zero-shot capability: the system clones unseen instruments and performs polyphonic synthesis from MIDI without fine-tuning. MSS and transcription F-score reflect accurate timbre and performance, while CLAP score quantifies embedding similarity but underestimates effect (wet/dry) nuances.
5. Strengths, Limitations, and Future Directions
Strengths:
- Zero-shot cloning of unseen instruments and direct polyphonic synthesis.
- Unification of audio-based cloning, text-guided synthesis, and timbre interpolation under a single Transformer with shared cross-modal conditioning.
- Smooth interpolation between audio and text timbre prompts; supports sound design flexibility.
Limitations:
- Non-real-time generation: requires the complete MIDI context in advance.
- Autoregressive sampling incurs temporal drift from strict MIDI timing.
- Velocity quantization limited to 4 discrete levels, dictated by dataset constraints.
- CLAP embedding omits detailed audio-effect cues, reducing realism for “wet” sources.
Potential Advances:
- Streaming/real-time generation architectures.
- Enhanced velocity encoding/granularity.
- Richer cross-modal embeddings capturing fine-grained effects or extended timbral nuances.
6. Relation to Broader Token-Based Synthesis Paradigms
TokenSynth exemplifies a broader paradigm shift toward discrete neural acoustic/symbolic token modeling for music and audio generation. This approach is paralleled in non-autoregressive models such as VampNet’s masked token modeling for music (Garcia et al., 2023) and image-style token modulation frameworks in visual synthesis (Zeng et al., 2021). A distinguishing feature of TokenSynth is its explicit cross-modal timbre conditioning (audio/text via CLAP) and capacity for smooth, zero-shot interpolation between modalities, positioning it at the intersection of symbolic control, cross-modal retrieval, and downstream generative synthesis.