Tokenization as Pre-compression

Updated 12 September 2025

Tokenization as pre-compression is a method that converts raw sequential data into compact, invertible token representations, reducing redundancy while maximizing information density.
It employs algorithmic techniques such as BPE, GreedTok, and PathPiece, guided by information theory principles like Shannon and Rényi efficiency to balance token frequency and performance.
In multimodal domains, advanced tokenization methods optimize visual and sequential data compression, enabling scalable hardware acceleration and improved inference speed.

Tokenization as pre-compression is the process by which raw data—most commonly natural language text or, more generally, sequential data—are mapped to a more compact, discrete representation before being processed by models such as LLMs or vision transformers. Its design and optimization derive from principles in information theory and data compression, and it serves as the analog to a back-end compressor within machine learning pipelines. Pre-compression via tokenization aims to minimize sequence length, reduce redundancy, and maximize information density per token, while maintaining invertibility and supporting efficient downstream inference and training.

1. Theoretical Foundations: Tokenization as Dictionary-Based Pre-compression

Tokenization formalizes as an invertible function $t : \Sigma^* \to D$ where $\Sigma$ is the alphabet (e.g., Unicode codepoints), and $D$ is a set of tokens. By requiring that $t$ be bijective (or at least invertible on its range), the process is lossless, establishing a direct parallel to lossless channel coding (Zouhar et al., 2023).

This dictionary-coding process mirrors classical data compression, mapping recurrent patterns into dictionary entries (tokens) that replace longer substrings with compact indices. Prominent tokenization algorithms—such as Byte-Pair Encoding (BPE), Unigram LM, and recent extensions like GreedTok (Lim et al., 8 Jan 2025) or MultiTok (Elias et al., 28 Oct 2024)—are computational analogs of dictionary compressors. Advanced variants, for instance, PathPiece (Schmidt et al., 28 Feb 2024) or BoundlessBPE (Schmidt et al., 31 Mar 2025), explicitly relax or optimize the conventional dictionary constraints to achieve improved compression.

From an information-theoretic standpoint, tokenization aims to minimize the expected code length when re-encoding language through the induced token distribution $p(\delta)$ . The Shannon entropy

$H(W) = -\sum_{\delta\in D} p(\delta) \log p(\delta)$

provides a lower bound on the achievable average code length. However, minimization subject to vocabulary budget ( $|D|$ ) and downstream task constraints (e.g., modeling capability) is a complex optimization problem, shown to be NP-complete both for direct vocabulary selection and bottom-up merge strategies (Whittington et al., 19 Dec 2024, Lim et al., 8 Jan 2025).

2. Compression Metrics and Information-Theoretic Efficiency

Two major entropic efficiencies govern the pre-compression quality of a tokenizer:

Shannon Efficiency: Optimal token-level encoding achieves average length $H(W) \leq \mathbb{L} \leq \lceil H(W) \rceil$ . However, high-frequency tokens receive very short codes and rare tokens arbitrarily long ones, potentially leading to poor model generalization because rare tokens are under-exposed (Zouhar et al., 2023).
Rényi Efficiency: To penalize distributions with either extreme token frequencies (either too high or too low), Rényi entropy of order $\alpha$

$H_\alpha(W) = \frac{1}{1-\alpha} \log \Big( \sum_{\delta\in D} p(\delta)^\alpha \Big)$

is employed. As shown empirically, Rényi efficiency with $\alpha=2.5$ correlates strongly with downstream performance metrics such as BLEU ( $\rho=0.78$ ) in machine translation, whereas mere sequence compression (ratio of token sequence lengths) has a weak or negative correlation ( $\rho=-0.32$ ) (Zouhar et al., 2023).

The implication is that optimal tokenizers for learning do not merely aim for minimum sequence length but seek to balance the induced token frequency distribution, avoiding both overwhelmingly common and exceedingly rare tokens.

3. Algorithmic and Optimization Perspectives

The practical construction of a compression-optimal tokenizer is computationally hard. Both (Whittington et al., 19 Dec 2024) and (Lim et al., 8 Jan 2025) establish the NP-completeness of the optimal tokenization problem by reducing classic combinatorial optimization problems (Max-2-SAT, vertex cover) to vocabulary selection and merge sequence identification under compression objectives.

Several classes of algorithms are distinguished:

Bottom-Up Merge: BPE and its variants iteratively select the most frequent adjacent pairs or sub-sequences for merging (greedily), building up a token vocabulary. This process maximizes local frequency-based compression but is not globally optimal.
Direct Cover/Partition: GreedTok and weighted maximum coverage methods select substrings/tokens based on maximum “coverage” of the character adjacency graph, optimizing for aggregate compression on the corpus (Lim et al., 8 Jan 2025).
Optimal Segmentation: The PathPiece algorithm finds the minimum-token segmentation for a document as a shortest-path problem in a directed acyclic graph, achieving the optimal compressed form for a fixed vocabulary (Schmidt et al., 28 Feb 2024).

Despite such advancements, all practical algorithms trade optimality for scalability, relying on approximation or greedy heuristics. There is no polynomial-time algorithm for the globally optimal (fewest-token) vocabulary under compression-only objectives.

4. Empirical Findings: Compression and Downstream Model Performance

Systematic ablation studies reveal a nuanced relationship between pre-compression and model quality:

Tokenizers achieving better compression—measured by shorter tokenized sequence lengths for the same text—are strongly correlated with improved downstream performance, particularly in generative settings (summarization, question generation) and for smaller models (Goldman et al., 10 Mar 2024). This correlation persists across typologically divergent languages (e.g., English versus Turkish), suggesting a universal aspect to tokenization’s pre-compression benefits.
Over-aggressive compression, such as via maximally long tokens or overly large vocabularies, can degrade downstream accuracy due to poor rare-word coverage and the risk of fragmenting semantically coherent units. For example, the “Identity” pre-tokenization regime can reduce the number of tokens by >30% but yields a drastic drop in code generation performance (Dagan et al., 1 Feb 2024).
Tokenization algorithms optimized exclusively for minimum token count (e.g., via PathPiece) do not consistently outperform those with more balanced segmentation (e.g., BPE or Unigram LM) (Schmidt et al., 28 Feb 2024). Key additional factors include pre-tokenization regular expressions, vocabulary initialization (such as BPE or Unigram), segmentation algorithmics, and alignment with the underlying linguistic or task structure.

The net effect is that while compression is an excellent intrinsic predictor of tokenizer quality—especially for generation and for low-resource/language-agnostic settings—holistic design must balance compression with linguistic, morphological, and structural properties, as well as vocabulary-management trade-offs.

5. Extensions: Tokenization as Pre-compression in Vision and Multimodal Domains

In vision transformers and image generative models, tokenization and its pre-compression function extend beyond text:

Variable-Length and Quality-Controllable Tokenization: Approaches such as One-D-Piece (Miwa et al., 17 Jan 2025) and ResiTok (Liu et al., 3 May 2025) define a latent code space (via a ViT encoder and vector quantization) and then organize the codebook indices into hierarchical or truncated token sequences. The One-D-Piece “Tail Token Drop” mechanism achieves quality-controlled variable compression, concentrating critical image information in the sequence head and allowing tokens to be dropped or truncated for bandwidth-adaptive transmission.
Group-wise Discrete Tokenizers: WeTok (Zhuang et al., 7 Aug 2025) introduces group-wise lookup-free quantization to scale codebooks efficiently (lowering memory requirements) while ensuring high-fidelity reconstructions, particularly at extreme compression ratios (e.g., rFID of 0.12 on ImageNet).
Layer-wise Token Compression: In vision transformers, direct token reduction during inference and training (via matrix-based aggregation, pruning, or merging) can be seen as operator-level pre-compression. Methods such as Prune and Merge (Mao et al., 30 Mar 2025) and Token Transforming (Zeng et al., 6 Jun 2025) enable substantial computational reduction (e.g., 1.5–1.6× speedup, 40% FLOPs reduction on DeiT-S) by dynamically aggregating tokens while preserving semantically and spatially important information.

Compression in these settings is measured both by sequence length reduction and downstream task-specific metrics such as PSNR, rFID, or semantic consistency under lossy token transmission conditions.

6. Implementation Considerations and Hardware Efficiency

Tokenization, as a pre-compression operator, is a critical computational bottleneck, especially at scale:

Parallel and Hardware-Accelerated Tokenization: BlockBPE (You, 16 Jul 2025) eliminates regex pre-tokenization, enabling full GPU parallelization of BPE merges and achieving up to 2.5× throughput improvement on high-batch inference over traditional libraries. Its architecture closely mirrors the workflow of block compression schemes, with each merge pass implemented as parallel reductions and index compaction.
Trade-offs emerge between higher compression (and hence faster batch processing) and linguistic/tokenization quality, especially in sensitive tasks such as math or code generation. Simplifying pre-tokenization to bytes—while computationally optimal—may degrade the treatment of domain-specific or morphologically complex structures.

In LLM pretraining and inference, improved pre-compression increases the effective input window, reduces latency and compute, and enables more efficient scaling laws. However, the design must account for downstream application requirements, hardware deployment considerations, and the possible side effects on data distribution.

7. Directions for Future Research and Tokenization Objectives

The literature converges on several open problems and future directions:

Objective Functions Beyond Compression: While token compression metrics (sequence length, bits per byte, entropy) remain central, information-theoretic criteria such as Rényi efficiency (Zouhar et al., 2023, Schmidt et al., 31 Mar 2025) and coverage-based objectives (Lim et al., 8 Jan 2025) better capture learnability and robustness. Investigating hybrid or multi-objective criteria—balancing compression, uniformity, and token informativeness—remains crucial.
Approximation and Hybrid Algorithms: With the NP-complete nature of optimal tokenization, efficient approximation algorithms (e.g., GreedTok, coverage relaxations) are necessary. Hybrid methods, combining direct vocabulary selection and bottom-up merges, may yield new tractable optima.
Breaking Pre-tokenization Constraints: Methods that relax or bypass traditional pre-tokenization, such as the superword merges of BoundlessBPE (Schmidt et al., 31 Mar 2025), can improve both compression and generalization (as measured by Rényi efficiency), suggesting new research avenues that bridge NLP tokenization and generic dictionary compression.

This synthesis establishes tokenization as a fundamental pre-compression phase in sequential modeling, anchored in information-theoretic, combinatorial, and algorithmic principles. It underscores tokenization’s dual role in efficiency and as a representational interface between raw data and neural architectures, positioning it as a locus of significant theoretical and practical importance in contemporary machine learning pipelines.