MOSS-Audio-Tokenizer: Discrete Audio Tokens

Updated 29 May 2026

MOSS-Audio-Tokenizer is a discrete audio tokenization method that employs causal Transformers and residual vector quantization to convert raw waveforms into semantically rich tokens.
It integrates semantic-acoustic modeling with variable bitrate control, enabling efficient end-to-end streaming for applications like TTS, ASR, and multimodal fusion.
Empirical evaluations demonstrate state-of-the-art reconstruction quality and robustness across speech, music, and general audio, driven by large-scale training and innovative quantization techniques.

MOSS-Audio-Tokenizer is a family of discrete audio tokenization models developed to provide high-fidelity, scalable, and semantically rich representations of audio for use in large-scale LLMs, generative audio systems, and audio-language foundation models. Leveraging both homogeneous Transformer-based pipelines and recent advances in residual vector quantization (RVQ), MOSS-Audio-Tokenizer supports efficient audio compression, flexible semantic–acoustic token modeling, fully autoregressive decoding, and multi-modal integration, achieving state-of-the-art results in waveform reconstruction, representation learning, and downstream generative and understanding tasks (Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026). The architecture is designed for end-to-end, causal, and streaming operation, supporting variable bitrates, robust large-batch training, and seamless integration into TTS and ASR systems as well as emerging multi-modal information fusion paradigms.

1. Core Architecture: Causal Transformer Tokenizer

MOSS-Audio-Tokenizer is primarily implemented as a Causal Audio Tokenizer (CAT) pipeline, consisting of Transformer-based encoders and decoders with a stack of residual quantizers (Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026). The complete pipeline processes raw 24 kHz waveforms as input and produces a multistream discrete token sequence by:

Patchifying and striding waveform samples through four hierarchical stages, downsampling to a frame rate of 12.5 Hz.
Encoding with 68 stacked causal Transformer blocks divided into four stages (stages 1–3: 12 blocks, D=768, 12 heads; stage 4: 32 blocks, D=1280, 20 heads). All blocks employ rotary positional embeddings and sliding-window self-attention (10 s window).
Quantizing each frame with an $N_q$ -layer RVQ stack (each layer: codebook size 1024, latent dim 8–10, codes L2-normalized). During training, quantizer-dropout is fully enabled to ensure robustness to any active bitrate.
Decoding with a mirrored 68-block causal Transformer, reconstructing the waveform via learned upsampling and inverse-patchify operations.

This homogeneous end-to-end Transformer architecture enables lossless integration of quantization and semantic supervision (via a 0.5B parameter decoder LLM head) and supports full-joint optimization (Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026).

2. Tokenization, Quantization, and Variable Bitrate

Discrete tokenization in MOSS-Audio-Tokenizer is performed via stacked Residual Vector Quantizers (RVQ), mapping frame-level encoder outputs $z_c$ to codebook vectors $q_c(z_c)$ and outputting token indices per quantization layer. The objective balances:

Commitment loss: $\mathcal{L}_\mathrm{cmt} = \sum_c \| z_c - \mathrm{sg}(q_c(z_c)) \|_2^2$
Codebook (embedding) loss: $\mathcal{L}_\mathrm{code} = \sum_c \| \mathrm{sg}(z_c) - q_c(z_c) \|_2^2$

With $N_q=32$ layers at 12.5 Hz, granularity is tuneable by RVQ truncation at inference, yielding variable bitrates from 0.125 to 4 kbps without retraining. This property is enabled by quantizer-dropout (random per-layer disabling during training), promoting decoder robustness across all operating points (Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026). Each layer has its own learned token embedding, and embeddings are summed across active codebooks at each frame for input to downstream models.

3. Training Objectives and Large-Scale Pretraining

The total loss $\mathcal{L}_G$ for the CAT pipeline consists of:

Multi-scale Mel-spectrogram L1 ( $\mathcal{L}_\mathrm{rec}$ )
Semantic cross-entropy (e.g., ASR/text captioning via the LLM head on quantized codes, $\mathcal{L}_\mathrm{sem}$ )
Commitment/codebook losses
Adversarial hinge-loss and feature-matching with multi-period and multi-resolution discriminators

Training proceeds in stages: non-adversarial pretraining (excluding adversarial/feature-matching losses), followed by adversarial fine-tuning. The typical training set includes ∼3M hours of diverse speech, general audio, and music sampled from large-scale datasets (AudioSet, MUSDB, multilingual speech corpora). Batch sizes, as high as 1536 (pretrain) and 768 (finetune), are shown to correlate positively with final reconstruction fidelity (Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026).

4. Empirical Performance and Ablations

MOSS-Audio-Tokenizer sets open-source SOTA for both objective and subjective reconstruction scores across speech, general audio, and music domains. Typical benchmark results (LibriSpeech/AISHELL/AudioSet/MUSDB) demonstrate:

Bitrate (bps)	#VQ	SIM (EN/CN)	STOI (EN/CN)	PESQ-NB (EN/CN)	Mel-Loss (A/M)	STFT-Dist (A/M)
1000	8	0.88/0.81	0.94/0.91	3.38/2.96	0.82/0.80	2.16/2.04
4000	32	0.97/0.93	0.97/0.96	3.95/3.71	0.68/0.64	1.96/1.82

Ablation studies confirm that increasing codebook size improves UTMOS and perceptual quality up to a capacity threshold (e.g., 4096 entries best; under-utilization observed beyond 8192). Attention modules, longer context windows, multi-scale discriminators, and the use of inverse-FFT upsampling all yield measurable improvements (Ji et al., 2024). Quantizer-dropout during training is essential for smooth quality–bitrate trade-offs and avoiding retraining per target bitrate (Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026).

5. Integration in TTS, ASR, and LLM Tasks

The token stream from MOSS-Audio-Tokenizer is directly consumable by downstream autoregressive models:

MOSS-TTS and MOSS-TTS-Local-Transformer: predict 33-stream sequences (text + 32 RVQ layers) for speech synthesis; models support zero-shot voice clones, fine-scale duration control, and robust long-form generation (Gong et al., 18 Mar 2026).
ASR: direct token aggregation (pooling/sum) for every time step, mapped into token embeddings and fed to LLM backbones (e.g., Qwen3, Llama 3) for transcription—enabling ASR without separate encoders (Gong et al., 11 Feb 2026).
Factorized semantic-acoustic frameworks such as UniAudio 2.0 introduce dual-stream (“reasoning” and “reconstruction”) tokenizers, with text-aligned abstractions at low temporal rates for high-level understanding and higher-rate codes for fidelity (Yang et al., 4 Feb 2026).

All models support frame-by-frame streaming modes due to the strict causality of both encoding and decoding paths, with bounded-latency attention.

Recent research on video-enhanced audio tokenization demonstrates that integrating visual information with MOSS-Audio-Tokenizer can significantly improve downstream understanding (e.g., AVQA), but only with specialized pre-quantization fusion mechanisms. The Timing-Aware Pre-Quantization Fusion (TAPF) paradigm injects visual features into the continuous encoder representation prior to quantization, leveraging timing-aware distillation losses. This approach outperforms contrastive or post-quantization fusion, maintaining near-parity in audio reconstruction metrics while increasing AVQA accuracy by 7.4% (e.g., AVQA accuracy: TAPF 0.7208 vs audio-only 0.6734) (Zhang et al., 13 Apr 2026). Under heavy compression (e.g., fewer tokens/sec), TAPF provides the largest gains, but yields significant improvements even at higher bitrates.

PairAlign introduces an alternative self-alignment framework for compact symbolic audio tokenization, optimizing sequence consistency, length control, and edit similarity via cross-view autoregressive modeling and EMA-teacher objectives, with potential adaptation to MOSS-Audio-Tokenizer as a multi-modal symbolic fusion front end (Banerjee et al., 7 May 2026).

7. Comparative Models and Factorized Tokenizers

MOSS-Audio-Tokenizer’s approach is contrasted with other leading tokenizers:

RVQGAN-based speech tokenizers (Shechtman et al., 2024) achieve transparent speech coding at 150–300 tokens/s (1.5–3 kbps) but rely on convolutional frontends and multi-resolution GAN loss.
WavTokenizer (Ji et al., 2024) employs a single-quantizer paradigm, achieving SOTA at extreme compression rates (40–75 tok/s) through large codebooks (4096+), attention-augmented decoders, and inverse-Fourier upsampling.
UniAudio 2.0 (Yang et al., 4 Feb 2026) advocates two-stream (reasoning/reconstruction) tokenization, separating high-level textual alignment and acoustic fidelity, with groupwise VQ, FiLM cross-conditioning, and flow-based diffusion decoding.

A recurring theme is the growing importance of text-aligned semantics, variable-rate tokenization, and modular adaptation for different downstream tasks and multi-modal fusion requirements.

References:

(Gong et al., 11 Feb 2026, Gong et al., 18 Mar 2026, Zhang et al., 13 Apr 2026, Ji et al., 2024, Shechtman et al., 2024, Yang et al., 4 Feb 2026, Banerjee et al., 7 May 2026)