Discrete Token-based Language Models
- Discrete token-based language models map raw inputs from text, speech, vision, and molecules into sequences of discrete symbols using specialized tokenizers.
- They utilize segmentation, vocabulary learning, encoding, and decoding modules that employ both statistical and learned quantization techniques.
- These models enhance compression and efficiency across applications, though challenges like codebook collapse and limited granularity persist.
A discrete token-based LLM (DTLM) is a paradigm in which raw input data—text, speech, images, molecules, or other modalities—are mapped through dedicated tokenizer modules to sequences of symbols from a finite vocabulary, enabling language-model-based architectures to process and generate over uniformly discrete representations. DTLMs leverage advances in both statistical and learned discrete tokenization, are increasingly applied to multimodal and highly compressed settings, and are now prominent in both autoregressive and non-autoregressive generation frameworks.
1. Foundations of Discrete Tokenization
Discrete tokenization is the foundational operation in DTLMs, formally defined as the mapping from raw input (text, audio, vision, graph, etc.) to a finite-length sequence of discrete symbols , , with the vocabulary size (Jia et al., 18 Feb 2025). This interface underpins the compatibility of diverse data with the standard next-token prediction objective employed by LLMs.
Discrete tokenizers are generally composed of four modules:
- Segmentation/pre-tokenization: Divides the input into minimal processable units (e.g., subwords, frames, patches).
- Vocabulary learning: Defines a discrete dictionary, by data-driven criteria (e.g., maximizing corpus likelihood in BPE, unsupervised k-means codebooks in VQ).
- Encoding: Maps segmented units to discrete indices, by rule (in text) or via quantization of continuous representations (in audio, vision, molecular structure).
- Decoding: Maps discrete tokens back to the input domain for reconstruction, where applicable.
The vocabulary may be constructed via:
- Statistical subword analysis (BPE, unigram LM): Recursively merges frequently co-occurring symbols to balance token frequency and sequence length.
- Learned vector quantization (VQ-VAE, Product Quantization): Learns codebooks mapping embeddings to discrete indices to optimize joint reconstruction and representation utility (Li et al., 21 Jul 2025).
2. Discrete Tokenization Algorithms and Modalities
Tokenization strategies can be classified by heuristic/statistical vs. learned quantization design, and by data modality:
- Text: Uses BPE or unigram LM for vocabulary learning, providing full coverage and strong compression-quality trade-off.
- Speech: Employs vector quantization or k-means clustering over self-supervised features (HuBERT, WavLM), often followed by deduplication and subword modeling to create meta-tokens (Wang et al., 2024, Chen et al., 2023, Wang et al., 2024).
- Vision: Uses VQ-VAE or similar, dividing images into patches and mapping feature vectors to codebooks, sometimes with multi-stage (residual) or product quantization for more expressive representation (Jia et al., 18 Feb 2025, Li et al., 21 Jul 2025).
- Molecular: Employs graph encoders and vector quantization, e.g., Q-Former modules that transform molecular graphs into causally structured token sequences compatible with text LMs (Guo et al., 2024).
- Multimodal fusion: Combines modality-specific tokenizers into a unified vocabulary or interleaved token streams to enable joint modeling (Li et al., 21 Jul 2025).
| Modality | Preprocessing | Tokenizer Type | Typical Vocab Size |
|---|---|---|---|
| Text | BPE/Unigram subword units | Lookup/table | 30K–100K |
| Speech | Feature extraction + VQ | K-means/VQ-VAE | 2K–10K |
| Vision | Patchification + VQ-VAE | Codebook/FSQ/LFQ | 4K–64K |
| Molecule | GNN + Q-Former + VQ | Codebook | 1K–2K |
The choice of tokenizer, its capacity, and its inductive biases have a profound impact on the subsequent LM’s ability to model distributional properties, semantic granularity, and cross-modal alignment.
3. Autoregressive and Non-Autoregressive Discrete LMs
Autoregressive paradigm
In classical DTLMs, the tokenizer produces , and the model learns via maximum likelihood or its variants (Jia et al., 18 Feb 2025). Adaptations to the input pipeline are minimal: the LLM’s embedding and output matrices are expanded to accommodate the larger, potentially multi-modal vocabulary (Guo et al., 2024, Chen et al., 2023).
Recent advances redefine the LLM as a discrete latent compressor-decompressor, using adapters (e.g., LoRA) to minimally adjust the backbone for new token types. For example, in Z-token autoencoders (Li et al., 26 Mar 2026), a pretrained LLM is fine-tuned to map long input texts to short, variable-length code sequences, and then to reconstruct the original text exactly or produce downstream outputs directly from those codes. Both the encoder and decoder are trained jointly, with Gumbel-Softmax and straight-through estimator for discrete variable propagation.
Non-autoregressive and diffusion-based methods
Discrete diffusion LLMs (DLMs) offer an alternative, especially for parallel generation. In DLMs, the forward process corrupts each token independently (typically via masking), and a denoising model iteratively reconstructs the clean sequence. Classic masked DLMs operate strictly over hard masks and one-hot decodings (Weligalle, 2 Jul 2025, Jin et al., 27 Dec 2025). Recent work, such as EvoToken-DLM, augments this with progressive soft probabilistic states and continuous trajectory supervision, allowing tokens to move through mixed states before committing, enabling error revisability and richer use of model uncertainty (Zhong et al., 12 Jan 2026).
Discrete diffusion models show improved parallel decoding efficiency but have yet to match autoregressive LMs in perplexity and global sequence coherence, due largely to tokenwise marginal training and insensitivity to cross-token dependencies (Weligalle, 2 Jul 2025, Jin et al., 27 Dec 2025).
4. Applications, Performance, and Compression Trade-offs
DTLMs are deployed across a range of applications:
- Text compression and prompt optimization: Z-tokenization achieves up to 18× token-length reduction while allowing faithful reconstruction and efficient long context processing due to compressed attention cost (Li et al., 26 Mar 2026).
- Speech-to-text and TTS: Discrete speech token LMs enable unified modeling of speech and text, achieving strong naturalness and prosodic expressivity in TTS (Bark, AudioLM), but currently lag in intelligibility and speaker consistency due to quantization artifacts (Wang et al., 2024, Wang et al., 2024).
- Zero-resource speech and ASR: LM-aware tokenization (LAST) closes the gap between clustering quality and sequence modeling, improving both spoken LM metrics and ASR WERs versus standard k-means tokenization (Turetzky et al., 2024).
- Molecule–text unification: DTLMs with molecular tokenizers enable bidirectional molecule-to-text and text-to-molecule generation, crucial for property prediction and design tasks (Guo et al., 2024).
- Multimodal retrieval and recommendation: Vector quantization of semantic embeddings underpins high-speed generative document lookup and cold-start recommendation (Jia et al., 18 Feb 2025).
Empirical studies reveal important trade-offs:
- Lower bitrates and token granularity in discrete speech tokens yield much faster convergence and compactness but reduced semantic discrimination; scaling up model size and codebook adaptivity helps narrow this gap (Wang et al., 2024).
- Compression via variable-length, content-adaptive tokenization focuses representation on high-entropy, semantically rich regions while aggressively pruning redundancy (Li et al., 26 Mar 2026).
- Codebook collapse (dead discrete codes), untuned quantization rates, or misaligned token semantics degrade downstream LM perplexity and cross-modality generalization (Li et al., 21 Jul 2025).
5. Challenges and Limitations
Several limitations are recurrently observed:
- Codebook Collapse: In VQ approaches, a significant fraction of codes may go unused; code reset, reparameterization, or soft sampling ameliorate this (Li et al., 21 Jul 2025, Jia et al., 18 Feb 2025).
- Limited Granularity: Discrete tokenization, especially with fixed K, often loses fine detail necessary for high-fidelity semantic or acoustic reconstruction. Increasing K can plateau in returns and create unbalanced code utilization (Wang et al., 2024).
- Structural Mismatch: Tokenizers trained without LM supervision may cluster based on signal density, not on modeling utility for the LM. Joint or LM-aware training partially addresses this (Turetzky et al., 2024).
- Inefficient Dependency Modeling: Standard discrete DLMs operate with per-token CE losses, resulting in "marginal trap" phenomena where joint constraints are not enforced, leading to incoherent outputs when sampling in parallel (Jin et al., 27 Dec 2025).
- Irreversibility and Lack of Revisability: Hard-decoded states in discrete diffusion models preclude correction of early errors; trajectory supervision and soft state maintenance are recent mitigations (Zhong et al., 12 Jan 2026).
6. Emerging Directions and Best Practices
Recent and proposed enhancements span:
- Adaptive and hierarchical tokenization: Allowing per-input or per-position codebook sizes, variable-length quantization, and hierarchical vocabularies to better match input complexity and semantic structure (Jia et al., 18 Feb 2025, Li et al., 21 Jul 2025).
- Joint multimodal codebooks and unification: Merging codebooks across modalities, or aligning them in latent embedding space, to enable seamless multimodal generation and comprehension (Li et al., 21 Jul 2025).
- Hybrid continuous–discrete architectures: Combining the control and expressivity of continuous latent variables with the symbolic structure of discrete tokens (e.g., mixing prosody latents with discrete phonetic tokens in TTS) (Wang et al., 2024).
- Structure- and context-aware diffusion: Introducing position- and context-adaptive noising schedules and structured loss functions to align the diffusion process with linguistic dependencies (Jin et al., 27 Dec 2025).
- Revisable and progressive decoding: Facilitating model revisability and uncertainty maintenance throughout sequence generation, moving beyond binary mask paradigms (Zhong et al., 12 Jan 2026).
Practitioners are advised to select tokenization regimes that balance sequence length constraints, codebook utilization (with empirical thresholds around 80%+), downstream model size, and the information density of their target domain. Monitoring both reconstruction quality and LM perplexity is essential, as minimal distortion does not guarantee semantic fidelity.
7. Evaluation Metrics and Comparative Benchmarks
DTLMs are primarily evaluated and compared on:
- Compression Ratio/Sequence Length: Reduction relative to original input (up to 18× for Z-tokens on Wikipedia/HotpotQA (Li et al., 26 Mar 2026)), with preservation of task-relevant information.
- Reconstruction Error: Direct loss or application-specific perceptual metrics.
- Language Modeling Metrics: Token-wise perplexity, bits per token (BPT), and negative log-likelihood (NLL) (Weligalle, 2 Jul 2025).
- Downstream Task Accuracy: BLEU, FID, WER, ROC-AUC, retrieval accuracy, and human listening tests, reflecting both generic and application-specific proficiency (Wang et al., 2024, Wang et al., 2024, Guo et al., 2024).
- Codebook Utilization and Efficiency: Fraction of active codes and memory/latency characteristics.
Empirically, while autoregressive DTLMs remain dominant for generation quality and sequence coherence, diffusion-based discrete LMs achieve notable gains in decoding efficiency and parallelization under controlled conditions, albeit with an outstanding gap in cross-token dependency modeling (Weligalle, 2 Jul 2025, Jin et al., 27 Dec 2025). MoI (Mixture of Inputs) approaches illustrate that blending distributional model state with discrete token selection during generation can further enhance downstream performance with negligible computational cost (Zhuang et al., 20 May 2025).
Discrete token-based language modeling has evolved into an ecosystem spanning subword analysis, learned quantization, autoregressive and diffusion-based decoding, operating across text, speech, vision, and structured domains. Continued progress is likely to arise from adaptive, semantically aligned tokenization, hybridized inference, and structure-aware training protocols that reconcile compression efficiency with faithful and coherent output on both single and multimodal tasks.