Papers
Topics
Authors
Recent
2000 character limit reached

Duo-Tok: Dual-Codebook Music Tokenizer

Updated 2 December 2025
  • Duo-Tok is a source-aware dual-codebook semantic music tokenizer that decouples vocal and accompaniment tracks for scalable lyrics-to-song synthesis.
  • It employs a four-stage SSL-centered pipeline with Gaussian replacement noise and multi-task supervision to produce LM-friendly token sequences.
  • Empirical evaluations demonstrate that Duo-Tok reduces language model perplexity and maintains high reconstruction quality at ultra-low bitrates.

Duo-Tok is a source-aware, dual-codebook semantic music tokenizer designed for vocal-accompaniment generation within modern lyrics-to-song systems. It directly addresses the central problem in music tokenization: the trade-off between high-fidelity waveform reconstruction and language-model (LM) learnability, also known as the reconstruction-vs-LM dilemma. By introducing dual-track modeling explicitly into the tokenization process, Duo-Tok aims to achieve both low reconstruction error and LM-friendly token sequences, supporting high-quality and controllable music generation (Lin et al., 25 Nov 2025).

1. Reconstruction–LM Modeling Dilemma and Duo-Tok’s Rationale

Existing neural codecs for music, such as SoundStream, Encodec, and DAC, typically produce acoustic tokens that enable faithful waveform reconstruction but require large, high-entropy vocabularies. This presents a challenge for LMs due to elevated perplexity. Conversely, semantic tokenizers such as SemantiCodec, WavTokenizer, and X-Codec aggressively compress music into small-vocabulary tokens that facilitate LM training, but lose critical audio detail and degrade musical structure.

Codec benchmarks consistently reveal a Pareto frontier between audio reconstruction error (e.g., Mel-L1) and LM perplexity (PPL): improving one metric inevitably sacrifices the other. Furthermore, modern lyrics-to-song systems such as YuE and LeVo employ separate vocal and accompaniment streams for improved controllability and data efficiency. However, prior tokenizers either apply a single codebook to the whole mixture (ignoring track structure) or treat stems independently, missing the opportunity for cross-track semantic coupling.

Duo-Tok is designed to explicitly separate vocals and accompaniment at the tokenization stage, preserving sufficient local detail for waveform reconstruction while smoothing fragile acoustic features to support language modeling. This is facilitated through a four-stage, SSL-centered pipeline engineered for source awareness and high downstream performance.

2. Four-Stage SSL-Centered Pipeline

Duo-Tok’s processing pipeline comprises the following four sequential stages:

Stage 1: BEST-RQ–Style SSL Pretraining

  • Input: Log-Mel spectrogram frames at 50 Hz.
  • Architecture: Transformer encoder, BERT-style masking over time frames.
  • Objective: Masked-frame cross-entropy against randomly quantized targets qtq_t, yielding semantically rich, music-aware features.

Stage 2: Multi-Task Supervision and Gaussian Replacement Noise

  • Gaussian Replacement Noise: At a selected bottleneck layer, hidden vectors ht()h^{(\ell)}_t are replaced with Gaussian noise (εtN(0,σ2I)\varepsilon_t\sim\mathcal{N}(0,\sigma^2 I)) with probability pp, biasing representations toward long-range, LM-friendly structure.
  • Multi-Task Heads:
  1. CTC-based ASR head for lyric alignment.
  2. Mel-spectrogram reconstruction head incorporating spectral convergence and log-magnitude objectives.
  3. Chroma reconstruction head supervising tonal structure.
  4. Music source separation (MSS) mask head.
  • Objective: Weighted sum of the above loss terms, stabilizing and factorizing SSL features.

Stage 3: Dual-Codebook SimVQ with Hard Routing

  • Motivation: Vocals and accompaniment possess distinct semantic densities and statistics.
  • Implementation: The encoder’s weights are frozen; two vector-quantization codebooks (vocal/accompaniment) of size K=32,768K=32,768, embedding dimension d192d\approx 192 are trained; hard routing by data type.
  • Quantization: Codebook vectors are linearly transformed; nearest code selection; commitment loss to encourage codebook utilization.

Stage 4: Latent Diffusion Decoders

  • Process: Waveform is encoded into a low-rate “ear-VAE” latent yy. A DiT-style diffusion model reconstructs yy from Gaussian-noised variants, maximizing SI-SNR improvement and minimizing noise-prediction loss.
  • Conditioning: Diffusion UNet receives as input the discrete sequences of vocal and accompaniment tokens.

3. Implementation and Training Details

  • Bitrate and Compression: Duo-Tok achieves a 25 kbps to 0.75 kbps reduction (∼30×), maintaining competitive fidelity at this low bitrate.
  • Dual Codebooks: Each codebook contains 32,768 embeddings of dimension ≈192.
  • Training Data: Stage-1 leverages LibriTTS, FSD50K, FMA, and DISCO-10M. Stage-2/3 utilize music subsets, with stems extracted using Demucs, and data mixtures including vocal, accompaniment, and instrumental-only samples.
  • Optimization: AdamW optimizer (β1=0.9,β2=0.96\beta_1=0.9, \beta_2=0.96), weight decay 0.1. Learning rates, warmup, scheduler, and batch sizes are stage-specific.
  • Loss Weight Schedules: Stage-2: λCTC:λMel:λChr:λMSS=0.5:1:1:1\lambda_{\mathrm{CTC}}:\lambda_{\mathrm{Mel}}:\lambda_{\mathrm{Chr}}:\lambda_{\mathrm{MSS}}=0.5:1:1:1. Stage-3: λMel:λChr:λVQ=1:1:1\lambda_{\mathrm{Mel}}:\lambda_{\mathrm{Chr}}:\lambda_{\mathrm{VQ}}=1:1:1. Stage-4 sets λSI=1.0\lambda_{\mathrm{SI}}=1.0.

4. Empirical Evaluation

Duo-Tok’s performance is evaluated on the Codec-Evaluation benchmark, assessing bitrate, music tagging AP (MagnaTagATune), LM perplexity (PPL@1024), perceptual evaluation (PESQ), speech intelligibility (STOI), and Mel-L1 error. Results are summarized below:

Tokenizer Bitrate MTT AP↑ PPL@1024↓ PESQ↑ STOI↑ Mel L1↓
SemantiCodec 1.30 0.32 15.5 1.32 0.60 0.98
MuCodec-LeVo 0.70 0.26 8.10 1.21 0.57 1.37
Duo-Tok 0.75 0.35 4.75 1.82* 0.56* 0.74*

Vocal/accompaniment values reported separately.

Key observations include:

  • Duo-Tok achieves the highest music tagging AP (0.35) at 0.75 kbps.
  • It yields the lowest LM perplexity among compared codecs (PPL@1024 = 4.75; analogous values for MuCodec and SemantiCodec are 8.10 and ≫15).
  • Reconstruction metrics show PESQ of 1.82 (vocal), 1.21 (accompaniment), Mel L1 of 0.74 (vocal), 1.12 (accompaniment).
  • Pareto analysis reveals Duo-Tok shifting the frontier toward jointly lower PPL and competitive Mel error (Lin et al., 25 Nov 2025).

Dual-track LM probes further demonstrate that Duo-Tok’s tokens expose stronger cross-track dependencies and lower next-token perplexity than LeVo or YuE systems.

Controlled decoder ablations indicate that under equivalent diffusion decoders, Duo-Tok tokens enable higher reconstructability (e.g., vocal PESQ 1.76 vs. 1.28 with MuCodec).

5. Comparison with Prior Music Codecs

  • Reconstruction Codecs: Models such as Encodec and DAC require higher bitrates (∼6 kbps) to achieve PESQ ⩾ 2.2, but exhibit very poor LM perplexity (PPL@1024 ≫ 100).
  • Semantic Tokenizers: These reach low PPL at 0.5–1.3 kbps but are more lossy, with L1 > 0.9, PESQ < 1.4.
  • Dual-Track Music Codecs (e.g., MuCodec in LeVo): Attain competitive fidelity (PESQ 1.21, Mel-L1 1.37) at 0.7 kbps, but higher PPL (8.10) and lower music tagging performance (AP = 0.26).
  • Duo-Tok: Achieves a new balance: AP = 0.35, PPL@1024 = 4.75, PESQ ≈ 1.8 (vocal)/1.2 (accomp), at 0.75 kbps.

6. Innovations and Significance

Core innovations introduced in Duo-Tok include:

  1. A four-stage pipeline that systematically separates semantic shaping, feature regularization, discrete dual-codebook encoding, and high-fidelity latent diffusion decoding.
  2. Gaussian replacement noise integrated as an architectural regularizer, suppressing fragile details detrimental to LM learnability.
  3. Multi-task supervision with CTC, Mel/chroma spectrogram, and source-separation targets, enhancing preservation of lyrics, timbral, harmonic, and source features.
  4. Dual SimVQ codebooks with hard routing, providing explicit and effective separation of vocals and accompaniment in the token domain.

This suggests that the synergy of these elements enables Duo-Tok to jointly improve both the reconstructive and generative axes of music modeling, supporting more controllable and higher-fidelity downstream synthesis at low bitrates while simplifying LM training and inference.

7. Impact and Implications

Duo-Tok advances the empirical Pareto frontier of music codecs, jointly lowering LM perplexity and maintaining competitive fidelity at very low bitrates. It natively supports dual-track vocoder LMs for end-to-end lyrics-to-song generation, making it appropriate for tasks requiring both musical detail and cross-modal controllability. A plausible implication is that future generative music systems may increasingly adopt dual-codebook or multi-track tokenization strategies to fully exploit semantic decomposability in complex audio generation workflows (Lin et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Duo-Tok.