Duo-Tok: Dual-Codebook Music Tokenizer

Updated 2 December 2025

Duo-Tok is a source-aware dual-codebook semantic music tokenizer that decouples vocal and accompaniment tracks for scalable lyrics-to-song synthesis.
It employs a four-stage SSL-centered pipeline with Gaussian replacement noise and multi-task supervision to produce LM-friendly token sequences.
Empirical evaluations demonstrate that Duo-Tok reduces language model perplexity and maintains high reconstruction quality at ultra-low bitrates.

Duo-Tok is a source-aware, dual-codebook semantic music tokenizer designed for vocal-accompaniment generation within modern lyrics-to-song systems. It directly addresses the central problem in music tokenization: the trade-off between high-fidelity waveform reconstruction and language-model (LM) learnability, also known as the reconstruction-vs-LM dilemma. By introducing dual-track modeling explicitly into the tokenization process, Duo-Tok aims to achieve both low reconstruction error and LM-friendly token sequences, supporting high-quality and controllable music generation (Lin et al., 25 Nov 2025).

1. Reconstruction–LM Modeling Dilemma and Duo-Tok’s Rationale

Existing neural codecs for music, such as SoundStream, Encodec, and DAC, typically produce acoustic tokens that enable faithful waveform reconstruction but require large, high-entropy vocabularies. This presents a challenge for LMs due to elevated perplexity. Conversely, semantic tokenizers such as SemantiCodec, WavTokenizer, and X-Codec aggressively compress music into small-vocabulary tokens that facilitate LM training, but lose critical audio detail and degrade musical structure.

Codec benchmarks consistently reveal a Pareto frontier between audio reconstruction error (e.g., Mel-L1) and LM perplexity (PPL): improving one metric inevitably sacrifices the other. Furthermore, modern lyrics-to-song systems such as YuE and LeVo employ separate vocal and accompaniment streams for improved controllability and data efficiency. However, prior tokenizers either apply a single codebook to the whole mixture (ignoring track structure) or treat stems independently, missing the opportunity for cross-track semantic coupling.

Duo-Tok is designed to explicitly separate vocals and accompaniment at the tokenization stage, preserving sufficient local detail for waveform reconstruction while smoothing fragile acoustic features to support language modeling. This is facilitated through a four-stage, SSL-centered pipeline engineered for source awareness and high downstream performance.

2. Four-Stage SSL-Centered Pipeline

Duo-Tok’s processing pipeline comprises the following four sequential stages:

Stage 1: BEST-RQ–Style SSL Pretraining

Input: Log-Mel spectrogram frames at 50 Hz.
Architecture: Transformer encoder, BERT-style masking over time frames.
Objective: Masked-frame cross-entropy against randomly quantized targets $q_t$ , yielding semantically rich, music-aware features.

Stage 2: Multi-Task Supervision and Gaussian Replacement Noise

Gaussian Replacement Noise: At a selected bottleneck layer, hidden vectors $h^{(\ell)}_t$ are replaced with Gaussian noise ( $\varepsilon_t\sim\mathcal{N}(0,\sigma^2 I)$ ) with probability $p$ , biasing representations toward long-range, LM-friendly structure.
Multi-Task Heads:

CTC-based ASR head for lyric alignment.
Mel-spectrogram reconstruction head incorporating spectral convergence and log-magnitude objectives.
Chroma reconstruction head supervising tonal structure.
Music source separation (MSS) mask head.

Objective: Weighted sum of the above loss terms, stabilizing and factorizing SSL features.

Stage 3: Dual-Codebook SimVQ with Hard Routing

Motivation: Vocals and accompaniment possess distinct semantic densities and statistics.
Implementation: The encoder’s weights are frozen; two vector-quantization codebooks (vocal/accompaniment) of size $K=32,768$ , embedding dimension $d\approx 192$ are trained; hard routing by data type.
Quantization: Codebook vectors are linearly transformed; nearest code selection; commitment loss to encourage codebook utilization.

Stage 4: Latent Diffusion Decoders

Process: Waveform is encoded into a low-rate “ear-VAE” latent $y$ . A DiT-style diffusion model reconstructs $y$ from Gaussian-noised variants, maximizing SI-SNR improvement and minimizing noise-prediction loss.
Conditioning: Diffusion UNet receives as input the discrete sequences of vocal and accompaniment tokens.

3. Implementation and Training Details

Bitrate and Compression: Duo-Tok achieves a 25 kbps to 0.75 kbps reduction (∼30×), maintaining competitive fidelity at this low bitrate.
Dual Codebooks: Each codebook contains 32,768 embeddings of dimension ≈192.
Training Data: Stage-1 leverages LibriTTS, FSD50K, FMA, and DISCO-10M. Stage-2/3 utilize music subsets, with stems extracted using Demucs, and data mixtures including vocal, accompaniment, and instrumental-only samples.
Optimization: AdamW optimizer ( $\beta_1=0.9, \beta_2=0.96$ ), weight decay 0.1. Learning rates, warmup, scheduler, and batch sizes are stage-specific.
Loss Weight Schedules: Stage-2: $\lambda_{\mathrm{CTC}}:\lambda_{\mathrm{Mel}}:\lambda_{\mathrm{Chr}}:\lambda_{\mathrm{MSS}}=0.5:1:1:1$ . Stage-3: $\lambda_{\mathrm{Mel}}:\lambda_{\mathrm{Chr}}:\lambda_{\mathrm{VQ}}=1:1:1$ . Stage-4 sets $\lambda_{\mathrm{SI}}=1.0$ .

4. Empirical Evaluation

Duo-Tok’s performance is evaluated on the Codec-Evaluation benchmark, assessing bitrate, music tagging AP (MagnaTagATune), LM perplexity (PPL@1024), perceptual evaluation (PESQ), speech intelligibility (STOI), and Mel-L1 error. Results are summarized below:

Tokenizer	Bitrate	MTT AP↑	PPL@1024↓	PESQ↑	STOI↑	Mel L1↓
SemantiCodec	1.30	0.32	15.5	1.32	0.60	0.98
MuCodec-LeVo	0.70	0.26	8.10	1.21	0.57	1.37
Duo-Tok	0.75	0.35	4.75	1.82*	0.56*	0.74*

Vocal/accompaniment values reported separately.

Key observations include:

Duo-Tok achieves the highest music tagging AP (0.35) at 0.75 kbps.
It yields the lowest LM perplexity among compared codecs (PPL@1024 = 4.75; analogous values for MuCodec and SemantiCodec are 8.10 and ≫15).
Reconstruction metrics show PESQ of 1.82 (vocal), 1.21 (accompaniment), Mel L1 of 0.74 (vocal), 1.12 (accompaniment).
Pareto analysis reveals Duo-Tok shifting the frontier toward jointly lower PPL and competitive Mel error (Lin et al., 25 Nov 2025).

Dual-track LM probes further demonstrate that Duo-Tok’s tokens expose stronger cross-track dependencies and lower next-token perplexity than LeVo or YuE systems.

Controlled decoder ablations indicate that under equivalent diffusion decoders, Duo-Tok tokens enable higher reconstructability (e.g., vocal PESQ 1.76 vs. 1.28 with MuCodec).

5. Comparison with Prior Music Codecs

Reconstruction Codecs: Models such as Encodec and DAC require higher bitrates (∼6 kbps) to achieve PESQ ⩾ 2.2, but exhibit very poor LM perplexity (PPL@1024 ≫ 100).
Semantic Tokenizers: These reach low PPL at 0.5–1.3 kbps but are more lossy, with L1 > 0.9, PESQ < 1.4.
Dual-Track Music Codecs (e.g., MuCodec in LeVo): Attain competitive fidelity (PESQ 1.21, Mel-L1 1.37) at 0.7 kbps, but higher PPL (8.10) and lower music tagging performance (AP = 0.26).
Duo-Tok: Achieves a new balance: AP = 0.35, PPL@1024 = 4.75, PESQ ≈ 1.8 (vocal)/1.2 (accomp), at 0.75 kbps.

6. Innovations and Significance

Core innovations introduced in Duo-Tok include:

A four-stage pipeline that systematically separates semantic shaping, feature regularization, discrete dual-codebook encoding, and high-fidelity latent diffusion decoding.
Gaussian replacement noise integrated as an architectural regularizer, suppressing fragile details detrimental to LM learnability.
Multi-task supervision with CTC, Mel/chroma spectrogram, and source-separation targets, enhancing preservation of lyrics, timbral, harmonic, and source features.
Dual SimVQ codebooks with hard routing, providing explicit and effective separation of vocals and accompaniment in the token domain.

This suggests that the synergy of these elements enables Duo-Tok to jointly improve both the reconstructive and generative axes of music modeling, supporting more controllable and higher-fidelity downstream synthesis at low bitrates while simplifying LM training and inference.

7. Impact and Implications

Duo-Tok advances the empirical Pareto frontier of music codecs, jointly lowering LM perplexity and maintaining competitive fidelity at very low bitrates. It natively supports dual-track vocoder LMs for end-to-end lyrics-to-song generation, making it appropriate for tasks requiring both musical detail and cross-modal controllability. A plausible implication is that future generative music systems may increasingly adopt dual-codebook or multi-track tokenization strategies to fully exploit semantic decomposability in complex audio generation workflows (Lin et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation (2025)

Duo-Tok: Dual-Codebook Music Tokenizer

1. Reconstruction–LM Modeling Dilemma and Duo-Tok’s Rationale

2. Four-Stage SSL-Centered Pipeline

Stage 1: BEST-RQ–Style SSL Pretraining

Stage 2: Multi-Task Supervision and Gaussian Replacement Noise

Stage 3: Dual-Codebook SimVQ with Hard Routing

Stage 4: Latent Diffusion Decoders

3. Implementation and Training Details

4. Empirical Evaluation

5. Comparison with Prior Music Codecs

6. Innovations and Significance

7. Impact and Implications

Whiteboard

Follow Topic

Continue Learning

Duo-Tok: Dual-Codebook Music Tokenizer

1. Reconstruction–LM Modeling Dilemma and Duo-Tok’s Rationale

2. Four-Stage SSL-Centered Pipeline

Stage 1: BEST-RQ–Style SSL Pretraining

Stage 2: Multi-Task Supervision and Gaussian Replacement Noise

Stage 3: Dual-Codebook SimVQ with Hard Routing

Stage 4: Latent Diffusion Decoders

3. Implementation and Training Details

4. Empirical Evaluation

5. Comparison with Prior Music Codecs

6. Innovations and Significance

7. Impact and Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics