XCodec2 Audio Tokens: Hybrid Quantization
- XCodec2 audio tokens are discrete representations that combine acoustic signals with self-supervised semantic features to preserve both audio fidelity and linguistic content.
- They use a hybrid architecture that fuses downsampled waveform data with SSL-derived embeddings via residual vector quantization and k-means clustering.
- Empirical results indicate enhanced performance in WER, phonetic discriminability, and perceptual quality, benefiting tasks such as TTS, ASR, and universal vocoding.
XCodec2 audio tokens refer to discrete representations of audio signals derived through a hybrid architecture that integrates semantic information from self-supervised models into the tokenization pipeline, with the goal of enhancing downstream language modeling, speech synthesis, and related generative and discriminative audio tasks. The X-Codec framework, as described in "Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio LLM" (Ye et al., 2024), was developed to address the insufficiency of standard neural codecs (such as EnCodec) in preserving semantic integrity when audio is tokenized for use in audio LLMs. A later body of research (Mousavi et al., 2024) further systematizes the extraction and deployment of discrete tokens from self-supervised models, offering a rigorous recipe that is widely considered representative of “XCodec2” methodology.
1. Architectural Foundations of XCodec2 Tokens
The X-Codec pipeline as specified in (Ye et al., 2024) comprises dual branches—an acoustic stream originating from strided-convolutional downsampling of the input waveform , and a semantic stream derived from a frozen self-supervised model such as HuBERT-base-ls960 or WavLM-base-plus. The semantic branch produces hidden representations ( for HuBERT-base, for WavLM-base-plus), which are further projected via convolutional and linear layers before being concatenated with the acoustic features along a unified dimension (typically 1024 or 1536). Both branches feed into a residual vector quantization (RVQ) module, which applies sequential codebook quantizations (, per layer). The resulting quantized embedding is then exposed to two decoders: an acoustic reconstruction network (mirroring the encoder topology) and a semantic decoder tasked with reconstructing .
In contrast, the XCodec2 extraction recipe (Mousavi et al., 2024) formalizes semantic token quantization using plain 0-means clustering over layer-wise features from self-supervised models (WavLM-large or HuBERT-large), operating on selected layers (commonly 1 out of 24). Each layer’s frame-level hidden state 2 is assigned to a codebook centroid 3, resulting in a stack of discrete tokens 4, which are further embedded, combined (optionally via attention), and supplied to downstream generative or discriminative networks.
2. Mathematical Objectives and Losses
The X-Codec objective extends classical codec reconstruction by incorporating an explicit semantic loss:
5
where:
- 6: sum of mel-spectrogram reconstruction (7), STFT loss (8), and adversarial loss (9).
- 0: squared error between reconstructed semantic features 1 and targets 2 from the pre-trained SSL model.
- Commitment loss 3: standard in RVQ to encourage codebook utilization parity.
In XCodec2, semantic token extraction uses the 4-means quantization objective:
5
Centroids are updated by averaging assigned vectors, with an optional VQ-VAE style EMA update and an auxiliary commitment loss to mitigate codebook collapse.
3. Token Generation, Indexing, and Vocabulary
For X-Codec (Ye et al., 2024), the per-frame discrete token is a tuple of RVQ assignments 6, typically flattened. Each quantizer layer has 7 centroids; for 8, the vocabulary size is 9 sub-token types. XCodec2 (Mousavi et al., 2024) uses 0 and typically five SSL layers, resulting in 1 possible sub-tokens per frame (e.g., 2 or 3).
Token sequences resulting from X-Codec and XCodec2 can be seamlessly fed to LLMs or universal vocoders. Example: in a TTS scenario, semantically enhanced tokens yield tightly clustered code indices for challenging transitions (e.g., /w/→/ɜː/ in "world"), reducing textual hallucination.
4. Empirical Results: Trade-offs and Benchmarking
The X-Codec approach achieves state-of-the-art performance on core audio modeling tasks:
- WER reduction on speech synthesis (zero-shot VALL-E test-clean):
- EnCodec (AR+NAR): 6.37%
- Acoustic-only codec: 7.70%
- X-Codec (HuBERT): 4.07%
- X-Codec (WavLM): 3.26%
- Phonetic discriminability (ABX error):
- Baseline: within=20.1%, across=28.3%
- X-Codec: within=3.3%, across=4.3%
- HuBERT oracle: within=3.3%, across=4.1%
- Perceptual quality (UTMOS):
- Baseline: 3.72
- X-Codec: 4.01
Acoustic fidelity, as measured by Mel and STFT distances, drops marginally (4 relative), whereas intelligibility and subjective quality—reliant on semantic accuracy—improve substantially. For XCodec2, ASR WER on LibriSpeech is 6.96% (K=1000, WavLM-large tokens), TTS UTMOS is 3.65, and universal vocoder design enables dynamic bitrate/quality trade-offs.
5. Best Practices and Hyperparameter Guidelines
X-Codec (Ye et al., 2024)
- Semantic Encoder: HuBERT-base-ls960 or WavLM-base-plus; extract hidden states and mean-pool across layers.
- Projections: Each branch projected to half the unified dimension (5).
- RVQ: 6 layers, 7 codes per layer, codeword dim 8; EMA update and straight-through estimator for backprop.
- Loss Weights: 9, 0.
- Training: 400k steps on LibriSpeech, batch size 132, learning rate 2 (Adam), random 3 per batch for robustness.
XCodec2 (Mousavi et al., 2024)
- SSL Extraction: Use five layers spanning depth (e.g., 3, 7, 12, 18, 23).
- Codebook: 4; 5-dim embedding per token.
- Embedding Initialization: Random (not centroid-based); fine-tuned.
- Attention Fusion: Apply attention over embeddings from all layers, trained end-to-end.
- Universal Vocoder: Train HiFi-GAN using "layer dropout"—sample random subsets of tokens per batch to ensure robustness to missing layers.
| Component | X-Codec (Ye et al., 2024) | XCodec2 (Mousavi et al., 2024) |
|---|---|---|
| Semantic Source | HuBERT-base/WavLM-base-plus | WavLM-large/HuBERT-large (5 layers) |
| Tokenization Mechanism | RVQ after fusion | Per-layer k-means, stacked/fused |
| Codebook Size | 8 × 1024 (8192 total sub-tokens) | 5 × 1000–2000 |
| Downstream Usage | Audio LLM, universal vocoder | ASR, TTS, Emotion, Speaker ID, SE |
| Attention Mechanism | Not explicit | Layer/fusion attention for selectivity |
6. X-Codec2: Advances and Open Directions
The "X-Codec2" label is not formally established in the originating literature (Ye et al., 2024), but is now used to designate methodologies that integrate multi-layer SSL quantization, per-layer clustering, and universal attention-based fusion, as crystallized in (Mousavi et al., 2024). This approach generalizes original X-Codec to offer enhanced flexibility: variable bitrates by layer subsampling, task-adaptive fusion via attention, and a single universal vocoder for the entire token stack.
A plausible implication is that increasing codebook or embedding sizes (e.g., 6, 7) or leveraging larger SSL models (WavLM-Large) would further improve downstream metrics in content-rich or paralinguistically demanding scenarios, but this is not yet empirically documented in the cited sources. No claims regarding new residual mechanisms, reconstruction objectives, or architectural departures from the recipe outlined in (Mousavi et al., 2024) are made.
7. Relationship to Broader Audio Tokenization Paradigms
XCodec2 tokens occupy a hybrid regime, combining the semantic interpretability and ASR-friendliness of SSL-derived quantization with the low-rate fidelity and waveform robustness of neural codecs. Contrasted with pure "codec" tokens (SoundStream, EnCodec) or single-layer SSL quantization, the XCodec2 philosophy is characterized by:
- Multi-layer, high-compression semantic representation.
- Attention-based selection within downstream models, revealing task-preferred features.
- The capacity to support both discriminative tasks (ASR, speaker ID, emotion recognition) and generative tasks (TTS, music continuation) in a unified token framework.
This duality is central to recent progress in audio LLMs, particularly in tasks where semantic content, speaker information, and paralinguistics must be simultaneously preserved and manipulable in tokenized form (Ye et al., 2024, Mousavi et al., 2024).