Papers
Topics
Authors
Recent
Search
2000 character limit reached

XCodec2 Audio Tokens: Hybrid Quantization

Updated 15 April 2026
  • XCodec2 audio tokens are discrete representations that combine acoustic signals with self-supervised semantic features to preserve both audio fidelity and linguistic content.
  • They use a hybrid architecture that fuses downsampled waveform data with SSL-derived embeddings via residual vector quantization and k-means clustering.
  • Empirical results indicate enhanced performance in WER, phonetic discriminability, and perceptual quality, benefiting tasks such as TTS, ASR, and universal vocoding.

XCodec2 audio tokens refer to discrete representations of audio signals derived through a hybrid architecture that integrates semantic information from self-supervised models into the tokenization pipeline, with the goal of enhancing downstream language modeling, speech synthesis, and related generative and discriminative audio tasks. The X-Codec framework, as described in "Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio LLM" (Ye et al., 2024), was developed to address the insufficiency of standard neural codecs (such as EnCodec) in preserving semantic integrity when audio is tokenized for use in audio LLMs. A later body of research (Mousavi et al., 2024) further systematizes the extraction and deployment of discrete tokens from self-supervised models, offering a rigorous recipe that is widely considered representative of “XCodec2” methodology.

1. Architectural Foundations of XCodec2 Tokens

The X-Codec pipeline as specified in (Ye et al., 2024) comprises dual branches—an acoustic stream originating from strided-convolutional downsampling of the input waveform xRnx \in \mathbb{R}^n, and a semantic stream derived from a frozen self-supervised model such as HuBERT-base-ls960 or WavLM-base-plus. The semantic branch produces hidden representations SS^* (Hs=768H_s = 768 for HuBERT-base, Hs=1024H_s = 1024 for WavLM-base-plus), which are further projected via convolutional and linear layers before being concatenated with the acoustic features along a unified dimension HuH_u (typically 1024 or 1536). Both branches feed into a residual vector quantization (RVQ) module, which applies MM sequential codebook quantizations (M=8M=8, K=1024K=1024 per layer). The resulting quantized embedding UqU_q is then exposed to two decoders: an acoustic reconstruction network (mirroring the encoder topology) and a semantic decoder tasked with reconstructing SS^*.

In contrast, the XCodec2 extraction recipe (Mousavi et al., 2024) formalizes semantic token quantization using plain SS^*0-means clustering over layer-wise features from self-supervised models (WavLM-large or HuBERT-large), operating on selected layers (commonly SS^*1 out of 24). Each layer’s frame-level hidden state SS^*2 is assigned to a codebook centroid SS^*3, resulting in a stack of discrete tokens SS^*4, which are further embedded, combined (optionally via attention), and supplied to downstream generative or discriminative networks.

2. Mathematical Objectives and Losses

The X-Codec objective extends classical codec reconstruction by incorporating an explicit semantic loss:

SS^*5

where:

  • SS^*6: sum of mel-spectrogram reconstruction (SS^*7), STFT loss (SS^*8), and adversarial loss (SS^*9).
  • Hs=768H_s = 7680: squared error between reconstructed semantic features Hs=768H_s = 7681 and targets Hs=768H_s = 7682 from the pre-trained SSL model.
  • Commitment loss Hs=768H_s = 7683: standard in RVQ to encourage codebook utilization parity.

In XCodec2, semantic token extraction uses the Hs=768H_s = 7684-means quantization objective:

Hs=768H_s = 7685

Centroids are updated by averaging assigned vectors, with an optional VQ-VAE style EMA update and an auxiliary commitment loss to mitigate codebook collapse.

3. Token Generation, Indexing, and Vocabulary

For X-Codec (Ye et al., 2024), the per-frame discrete token is a tuple of RVQ assignments Hs=768H_s = 7686, typically flattened. Each quantizer layer has Hs=768H_s = 7687 centroids; for Hs=768H_s = 7688, the vocabulary size is Hs=768H_s = 7689 sub-token types. XCodec2 (Mousavi et al., 2024) uses Hs=1024H_s = 10240 and typically five SSL layers, resulting in Hs=1024H_s = 10241 possible sub-tokens per frame (e.g., Hs=1024H_s = 10242 or Hs=1024H_s = 10243).

Token sequences resulting from X-Codec and XCodec2 can be seamlessly fed to LLMs or universal vocoders. Example: in a TTS scenario, semantically enhanced tokens yield tightly clustered code indices for challenging transitions (e.g., /w/→/ɜː/ in "world"), reducing textual hallucination.

4. Empirical Results: Trade-offs and Benchmarking

The X-Codec approach achieves state-of-the-art performance on core audio modeling tasks:

  • WER reduction on speech synthesis (zero-shot VALL-E test-clean):
    • EnCodec (AR+NAR): 6.37%
    • Acoustic-only codec: 7.70%
    • X-Codec (HuBERT): 4.07%
    • X-Codec (WavLM): 3.26%
  • Phonetic discriminability (ABX error):
    • Baseline: within=20.1%, across=28.3%
    • X-Codec: within=3.3%, across=4.3%
    • HuBERT oracle: within=3.3%, across=4.1%
  • Perceptual quality (UTMOS):
    • Baseline: 3.72
    • X-Codec: 4.01

Acoustic fidelity, as measured by Mel and STFT distances, drops marginally (Hs=1024H_s = 10244 relative), whereas intelligibility and subjective quality—reliant on semantic accuracy—improve substantially. For XCodec2, ASR WER on LibriSpeech is 6.96% (K=1000, WavLM-large tokens), TTS UTMOS is 3.65, and universal vocoder design enables dynamic bitrate/quality trade-offs.

5. Best Practices and Hyperparameter Guidelines

  • Semantic Encoder: HuBERT-base-ls960 or WavLM-base-plus; extract hidden states and mean-pool across layers.
  • Projections: Each branch projected to half the unified dimension (Hs=1024H_s = 10245).
  • RVQ: Hs=1024H_s = 10246 layers, Hs=1024H_s = 10247 codes per layer, codeword dim Hs=1024H_s = 10248; EMA update and straight-through estimator for backprop.
  • Loss Weights: Hs=1024H_s = 10249, HuH_u0.
  • Training: 400k steps on LibriSpeech, batch size HuH_u132, learning rate HuH_u2 (Adam), random HuH_u3 per batch for robustness.
  • SSL Extraction: Use five layers spanning depth (e.g., 3, 7, 12, 18, 23).
  • Codebook: HuH_u4; HuH_u5-dim embedding per token.
  • Embedding Initialization: Random (not centroid-based); fine-tuned.
  • Attention Fusion: Apply attention over embeddings from all layers, trained end-to-end.
  • Universal Vocoder: Train HiFi-GAN using "layer dropout"—sample random subsets of tokens per batch to ensure robustness to missing layers.
Component X-Codec (Ye et al., 2024) XCodec2 (Mousavi et al., 2024)
Semantic Source HuBERT-base/WavLM-base-plus WavLM-large/HuBERT-large (5 layers)
Tokenization Mechanism RVQ after fusion Per-layer k-means, stacked/fused
Codebook Size 8 × 1024 (8192 total sub-tokens) 5 × 1000–2000
Downstream Usage Audio LLM, universal vocoder ASR, TTS, Emotion, Speaker ID, SE
Attention Mechanism Not explicit Layer/fusion attention for selectivity

6. X-Codec2: Advances and Open Directions

The "X-Codec2" label is not formally established in the originating literature (Ye et al., 2024), but is now used to designate methodologies that integrate multi-layer SSL quantization, per-layer clustering, and universal attention-based fusion, as crystallized in (Mousavi et al., 2024). This approach generalizes original X-Codec to offer enhanced flexibility: variable bitrates by layer subsampling, task-adaptive fusion via attention, and a single universal vocoder for the entire token stack.

A plausible implication is that increasing codebook or embedding sizes (e.g., HuH_u6, HuH_u7) or leveraging larger SSL models (WavLM-Large) would further improve downstream metrics in content-rich or paralinguistically demanding scenarios, but this is not yet empirically documented in the cited sources. No claims regarding new residual mechanisms, reconstruction objectives, or architectural departures from the recipe outlined in (Mousavi et al., 2024) are made.

7. Relationship to Broader Audio Tokenization Paradigms

XCodec2 tokens occupy a hybrid regime, combining the semantic interpretability and ASR-friendliness of SSL-derived quantization with the low-rate fidelity and waveform robustness of neural codecs. Contrasted with pure "codec" tokens (SoundStream, EnCodec) or single-layer SSL quantization, the XCodec2 philosophy is characterized by:

  • Multi-layer, high-compression semantic representation.
  • Attention-based selection within downstream models, revealing task-preferred features.
  • The capacity to support both discriminative tasks (ASR, speaker ID, emotion recognition) and generative tasks (TTS, music continuation) in a unified token framework.

This duality is central to recent progress in audio LLMs, particularly in tasks where semantic content, speaker information, and paralinguistics must be simultaneously preserved and manipulable in tokenized form (Ye et al., 2024, Mousavi et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XCodec2 Audio Tokens.