Information-Driven Tokenisation

Updated 27 December 2025

Information-driven tokenisation is a method that discretizes data using explicit information-theoretic criteria for optimal fidelity and compression.
It employs rate–distortion and mutual information objectives to balance semantic alignment, data compression, and encoding efficiency.
Key techniques include vector, residual, and product quantization alongside adaptive, semantic approaches to enhance downstream tasks like translation and video compression.

Information-driven tokenization is a paradigm in which the discretization of data—text, audio, images, or video—into tokens is governed by explicit information-theoretic criteria. Unlike traditional heuristic methods such as Byte Pair Encoding (BPE) or WordPiece, which rely on surface frequency or static heuristics, information-driven approaches balance fidelity, semantic alignment, and compression by optimizing objectives such as entropy, mutual information, or rate–distortion. This class of tokenization underpins advances across language modeling, data compression, semantic understanding, and multimodal systems, and forms the connective interface between raw signals and high-level machine learning architectures.

1. Formal Frameworks and Objectives

Information-driven tokenization operationalizes token learning by minimizing a trade-off between informativeness and compression, formalized either through rate–distortion theory or mutual information maximization. Two canonical formulations are widely used (Jia et al., 18 Feb 2025):

Rate–Distortion Objective:

$\min_{Enc,\,Q,\,Dec} D + \lambda R,$

where $D = \mathbb{E}_X[\| X - Dec(C_{Q}(Enc(X)))\|^2_2]$ (expected reconstruction error), $R = H(Q(Enc(X)))$ (entropy of token distribution), and $\lambda$ controls the fidelity–compression trade-off.

Mutual Information Objective:

$\max_{Enc,\,Q} I(X; T) \quad \text{s.t.}\ H(T) \le R_{max},$

with $T = Q(Enc(X))$ and $I(X;T) = H(T) - H(T|X)$ representing mutual information between input and tokens.

The design includes three tightly coupled modules: encoding (feature extraction), quantization (discretization), and supervision (reconstruction or predictive loss), jointly trained to optimize information metrics (Jia et al., 18 Feb 2025).

2. Algorithmic Mechanisms and Variants

The survey of discrete tokenization mechanisms (Jia et al., 18 Feb 2025) identifies several principal approaches, each differing in encoder architecture and quantization strategy:

Vector Quantization (VQ): Assign continuous encoder outputs to learned centroids, with straight-through gradient estimators and commitment loss. VQ-VAE and VQ-GAN exemplify this pattern.
Residual Quantization (RQ): Quantize residuals iteratively across multiple stages, reducing overall distortion and allowing hierarchical information partitioning.
Product Quantization (PQ): Decompose latent vectors into orthogonal groups, quantizing each with a separate codebook for efficient representation.
Lookup-Free Scalar Quantization (FSQ/LFQ): Quantize low-dimensional projections via rounding or sign mapping, producing large implicit codebooks without explicit storage.
Supervision: Decoders reconstruct (or predict) the input, with losses that can combine reconstruction, commitment, and entropy regularization terms.

Algorithmic sketch (from (Jia et al., 18 Feb 2025)):

Input: x
z ← Enc(x)
// Quantization
For each zᵢ:
    jᵢ ← argmin_k ||zᵢ - C[k]||²
    tᵢ ← C[jᵢ]
// Reconstruction
x̂ ← Dec({tᵢ})
Compute L = ||x - x̂||² + β L_cmt − λ H({jᵢ})
Backpropagate via STE

3. Tokenization Guided by Source Information Content

Information-theoretic metrics, such as entropy and surprisal, enable adaptive or locally optimal tokenization (Zouhar et al., 2023, Ye et al., 18 Dec 2025, Goriely et al., 23 Jun 2025):

Shannon & Rényi Entropy: The average information content of token distributions $H = -\sum p_i \log p_i$ , and the generalized Rényi entropy $H_\alpha(P) = \frac{1}{1-\alpha} \log(\sum p_i^\alpha)$ , are used to define channel efficiency and guide vocabulary selection. Maximizing channel efficiency $E_\alpha = H_\alpha(P)/\log V$ (with $\alpha \approx 2.5$ ) yields strong prediction of downstream BLEU scores in machine translation (Zouhar et al., 2023).
Surprisal-Driven Segmentation (ByteSpan): By leveraging per-byte surprisal or entropy from an external LLM, ByteSpan segments predictable spans into subwords, aligning token boundaries with morpheme structure and maximizing compression (Goriely et al., 23 Jun 2025). The algorithm groups low-surprisal bytes, positing boundaries at unpredictability spikes.
Adaptive Token Budgets (InfoTok): For video, allocation of token counts per segment is modulated by local ELBO, approximating minimal sufficient code length according to the data’s information density (Ye et al., 18 Dec 2025). This results in provably near-optimal compression with variable-length token streams.

4. Linguistically and Semantically Informed Approaches

Hybrid and semantic tokenization frameworks integrate linguistic rules to maximize semantic coherence of token boundaries, a hallmark of information-driven design in language applications:

Hybrid Morphological/BPE Tokenization: The Turkish-centric framework "Tokens with Meaning" (Bayram et al., 19 Aug 2025) combines a rule-based longest-match morphological root–affix dictionary (with phonological normalization and allomorph handling) and a statistical BPE fallback. This yields high Turkish Token Percentage (90.29%) and Pure Token Percentage (85.8%) on TR-MMLU, vastly outperforming frequency-based models.
Semantic Subword Tokenization: By partitioning the vocabulary into a semantic segment (stems, suffixes) and a BPE segment, and enforcing stem+suffix preference at encoding, the Semantic Tokenizer (Mehta et al., 2023) achieves twice the distinct wordform coverage and notable improvements in word/sentence embedding quality, as well as CoLA and QQP tasks on GLUE.

5. Multimodal and Unified Information-Driven Tokenization

Recent work generalizes information-driven tokenization across modalities—text, image, audio, and video—using information bottleneck principles:

Generative Information Bottleneck (GenIB): For stochastic tokenization, GenIB (Wei et al., 2 Jul 2025) implements a variational rate–distortion objective minimizing $I(X;T)$ subject to informativeness $I(\hat{T};X) \geq \chi$ . Variational bounds and neural reparameterization allow efficient training of tokenizers for multimodal data.
σ-GenIB: Addresses variance collapse by fixing the covariance of token distributions and balancing deterministic and stochastic losses, stabilizing representation diversity in autoregressive modeling.
Unified Multimodal Next-Token Modeling: Systems process both discrete (text) and continuous (visual, audio) tokens within a transformer under the next-token prediction objective, jointly supporting comprehension and generation (Wei et al., 2 Jul 2025).
Adaptive Video Tokenization: The InfoTok system (Ye et al., 18 Dec 2025) routes token budgets per video segment according to the negative ELBO, achieving up to 2.3× compression over heuristic adaptive baselines without performance degradation.

6. Comparative Evaluation and Empirical Results

Information-driven tokenizers demonstrate significant empirical improvements:

Task/Metric	Baseline	Information-driven Variant	Improvement/Key Result
Subword–per–word coverage (Text)	WordPiece/BPE	Semantic Tokenizer (Mehta et al., 2023)	2–3× more wordforms in ≤2 tokens
Morphological Alignment (macro-F₁, English, 16k)	BPE: 0.694	ByteSpan: 0.899 (Goriely et al., 23 Jun 2025)	+20F₁ vs. BPE/WP
Multilingual R-efficiency and fertility	BPE/WP	ByteSpan-mono/combo	Matches or improves; up to 10-point fertility gain in underrepresented scripts
Downstream BLEU (MT, English–German, 1M sents)	BPE tuning	E₂.₅-efficiency max (Zouhar et al., 2023)	Pearson ρ(E₂.₅, BLEU)=0.78; greatly outperforms sequence length and Shannon H
GLUE (CoLA MCC, BERT-base)	52.1	77.9 (semantic)	+25.8 MCC
Video Compression (BPP₁₆)	Cosmos	InfoTok (Ye et al., 18 Dec 2025)	20–50% token savings, ≤0.06 dB from optimal
Transformer Inference Cost (Video)	ElasticTok	InfoTok	12× reduction in wall-clock time

A plausible implication is that information-driven tokenization provides robust improvements in both intrinsic metrics (alignment, compression) and downstream performance, especially in morphologically complex settings, content-adaptive multimodal scenarios, and high-efficiency domains.

7. Limitations and Research Directions

Despite substantial progress, information-driven tokenization faces several challenges and open directions (Jia et al., 18 Feb 2025, Goriely et al., 23 Jun 2025):

Compression–Fidelity Trade-off: High compression can lead to detail loss; optimizing for both generative and discriminative utility is nontrivial.
Codebook Collapse and Utilization: VQ-based models risk "dead" codes. Lookup-free quantization strategies partially mitigate this but introduce other distortions.
Cross-Modal Alignment: Achieving semantic alignment of text/image/audio/video tokens in a unified representation remains fragile.
Language and Morphology Bias: Byte-based models may disadvantage non-Latin scripts unless vocabulary balancing is enforced (Goriely et al., 23 Jun 2025).
Adaptive and Dynamic Tokenization: Further research into per-instance entropy-based token allocation and automatic vocabulary resizing is ongoing (Ye et al., 18 Dec 2025).
Integration and Scalability: Ensuring robust performance in foundation models and seamless multimodal pipelines remains an active area.

Unified, information-theoretic tokenization principles continue to shape the evolution of both discrete and hybrid representation models, driving improved semantic fidelity, modality alignment, and efficiency in state-of-the-art AI systems.