Characteristics that make audio tokenizers suitable for native audio language models

Determine the specific architectural and representational characteristics that make a discrete audio tokenizer truly suitable as a native interface for autoregressive audio language models, beyond reconstruction fidelity and domain coverage, so that such tokenizers effectively support large-scale end-to-end audio language modeling.

Background

The paper surveys prior discrete audio tokenizers and highlights that many rely on pretrained encoders, distillation, or hybrid CNN–Transformer designs that may introduce fixed inductive biases and scaling bottlenecks. While recent work has improved reconstruction quality and added semantic richness through various strategies, the authors emphasize that the criteria for a tokenizer to be truly suitable for native audio LLMs are not settled.

This uncertainty motivates the proposal of CAT (Causal Audio Tokenizer with Transformer), a homogeneous, end-to-end, causal Transformer-based architecture that jointly optimizes encoder, quantizer, decoder, and discriminators. The open problem asks for a principled understanding of what properties an audio tokenizer must possess to best serve autoregressive audio language modeling at scale.

References

Despite these advances, it remains unclear what characteristics make an audio tokenizer truly suitable for native audio LLMs.

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models  (2602.10934 - Gong et al., 11 Feb 2026) in Related Works, Subsection “Discrete Audio Tokenizers”