Bottom-Up Tokenisation Algorithms
- Bottom-up tokenisation is a data-driven method that builds subword tokens from atomic units, supporting rare words and morphologically complex languages.
- It is instantiated by algorithms like BPE, WordPiece, and LiB, each balancing factors such as frequency, statistical likelihood, and compression.
- Due to NP-complete hardness, practical implementations rely on greedy heuristics to optimize token sequence length and vocabulary balance.
Bottom-up tokenisation refers to a class of algorithms that construct token vocabularies by starting from atomic units (typically characters or bytes) and iteratively merging adjacent units to form larger “subword” tokens. Unlike traditional top-down or rule-based tokenisers, which operate over fixed wordlists or linguistic heuristics, bottom-up approaches are entirely data-driven and produce open-vocabulary segmentations, enabling robust handling of rare words, neologisms, and morphologically complex languages (Mielke et al., 2021, Karthika et al., 21 Jun 2025).
1. Theoretical Foundations and Formal Definition
Bottom-up tokenisation is formally framed as the process of applying a sequence of pairwise merge operations to a dataset , where is a finite alphabet. Each merge consists of concatenating adjacent subwords and into a new symbol . A merge sequence induces a vocabulary with . The tokenisation of a string is .
The canonical objective is to minimise the total output length,
under a budget of at most merges (i.e., fixed vocabulary size). The central decision problem is, given , does there exist a merge sequence of length such that ? This is the “bottom-up tokenisation” decision problem (Whittington et al., 19 Dec 2024, Kastreva et al., 19 Nov 2025).
2. Algorithmic Instantiations: Byte-Pair Encoding, WordPiece, LiB, and Variants
The most widely used bottom-up tokenisation algorithm is Byte-Pair Encoding (BPE), which iteratively selects the most frequent adjacent symbol pair in the corpus and merges it into a new token. Each merge expands the vocabulary by one and the process repeats until the desired vocabulary size is reached. The segmentation process employs a longest-match-first strategy over the constructed vocabulary (Mielke et al., 2021, Karthika et al., 21 Jun 2025).
WordPiece modifies BPE by selecting merges that maximise an n-gram LLM likelihood increase rather than pure frequency, aligning merges more closely with statistical regularities (Mielke et al., 2021).
Less-is-Better (LiB) tokenisation, a cognitive-scientifically motivated extension, alternates between a memoriser phase (adding n-gram units that yield net token savings exceeding the type penalty) and a forgetter phase (removing units whose presence increases the total cost). The cost function to be minimised is , where and are the number of tokens and types, respectively. Empirically, LiB yields compression and vocabulary balance improvements over BPE (Yang, 1 Mar 2024).
Partition-cover approaches, such as GreedTok, reformulate bottom-up tokenisation as maximum coverage set function optimisation. Here, the goal is to select a vocabulary of size at most maximising , i.e., covering as many adjacent pairs as possible with non-overlapping substrings/tokens (Lim et al., 8 Jan 2025).
3. Computational Complexity and Hardness Results
Bottom-up tokenisation is computationally intractable in the worst case. Both “Tokenisation is NP-Complete” (Whittington et al., 19 Dec 2024) and “Tokenisation over Bounded Alphabets is Hard” (Kastreva et al., 19 Nov 2025) rigorously prove that, even under restrictive conditions (e.g., binary and merges limited), the problem is NP-complete and does not admit a polynomial-time approximation scheme unless P=NP. The proofs employ reductions from MAX-2-SAT and VERTEX COVER; for the partition-cover variant, NP-hardness is shown directly via a VERTEX COVER reduction (Lim et al., 8 Jan 2025).
Moreover, even approximating the minimal token sequence within a factor of for any constant is NP-hard—a fundamental barrier that explains why all practical bottom-up tokenisation algorithms are heuristic and greedy (Whittington et al., 19 Dec 2024, Kastreva et al., 19 Nov 2025, Lim et al., 8 Jan 2025). In special cases where the merge budget or alphabet is fixed and small, brute-force enumeration is possible, but this is of theoretical rather than practical significance (Whittington et al., 19 Dec 2024).
4. Empirical Performance and Cross-Linguistic Considerations
Empirically, merge-based tokenisers such as BPE and its variants show consistently strong performance in language modeling and translation tasks across a variety of languages—particularly for those with high morphological complexity or limited annotated data (Mielke et al., 2021, Karthika et al., 21 Jun 2025). Bottom-up tokenisation yields open vocabularies, allowing any out-of-vocabulary sequence to be decomposed to the atomic level without mapping to “unk.”
Comparative studies across 17 Indian languages reveal nuanced trade-offs: at small vocabulary sizes, Unigram LM (a top-down probabilistic model) slightly outperforms BPE in aligning with morpheme boundaries and minimising word fragmentation, particularly in highly inflected languages. However, as vocabulary size increases (128K+), BPE closes this gap and can even surpass Unigram LM on fertility and word fragmentation rate. Larger vocabularies reliably yield fewer tokens per word and increased subword length, decreasing the sequence length required by the LLM but increasing memory and compute cost in the embedding and output layers (Karthika et al., 21 Jun 2025).
In multilingual settings, bottom-up tokenisation on pooled corpora typically biases vocabulary construction toward high-resource languages. Cluster-based or language-adaptive vocabulary strategies mitigate this effect, improving segmentation fairness and downstream performance for low-resource languages. Zero-shot transfer of tokenisers from high-resource to low-resource related languages is empirically viable, with reasonable fertility and compression retained (Karthika et al., 21 Jun 2025).
5. Byte-Level and Partition Approaches
Byte-level bottom-up tokenisation, as instantiated by UTF8Tokenizer, bypasses traditional subword merging entirely: each UTF-8 byte is treated as its own atomic token, mapping directly to its numerical value in . No out-of-range or auxiliary IDs are used. C0 ASCII control bytes (0x00–0x1F, 0x7F) serve as unambiguous markers for special functions (padding, sequence boundaries, segment demarcation). The embedding is a simple matrix, trivial to share across models. Enhancements such as bit-biased embeddings exploit internal byte structure at training without increasing inference cost. This approach yields 14× faster tokenisation and 8× lower host-to-device memory requirements than standard BPE-style methods, and yields modest improvements in language modeling convergence (Moryossef et al., 19 Oct 2025).
Partition-cover approaches, typified by GreedTok, select tokens (substrings) to maximise non-overlapping coverage under a size constraint. GreedTok empirically achieves 3–5% better compression than BPE at comparable vocabulary sizes and approaches the theoretical maximum coverage approximation ratio. The selected token sets diverge substantially from those of BPE beyond the first few hundred merges, reflecting the restrictive local optimality of frequency-based merging (Lim et al., 8 Jan 2025).
| Algorithm | Greedy/Heuristic | Objective Optimised | Notable Empirical Result |
|---|---|---|---|
| BPE | Yes | Most frequent pair merging | 30k–50k merges optimal for English (Mielke et al., 2021) |
| LiB | Yes | Lowest bits-per-character on CTB8/BRphon (Yang, 1 Mar 2024) | |
| GreedTok | Yes | Partition-cover set function | 3–5% fewer tokens/word vs. BPE (Lim et al., 8 Jan 2025) |
| UTF8Tokenizer | Not applicable | None—bytes are tokens | 14× speedup, 8× less memory (Moryossef et al., 19 Oct 2025) |
6. Practical Implications and Best Practices
The computational intractability of optimal bottom-up tokenisation necessitates empirical, greedy, or approximation-based solutions. BPE, LiB, partition-cover relaxations, and other heuristic models all operate under this constraint. For LLM pretraining, partition-cover approaches such as GreedTok provide measurable reductions in token counts and vocabulary sizes, while also supporting integration of domain- or language-specific substrings. Bottom-up frameworks naturally support open vocabulary, straightforward handling of neologisms, and compositional robustness.
Practitioners should tune vocabulary size to the morphological properties of the target language; e.g., 30–50k merges suffice for English, while lower sizes yield better generalisation in polysynthetic languages. For multilingual applications, cluster-based or language-adaptive variants are recommended. In non-Latin scripts lacking explicit word boundaries, cognitively motivated models such as LiB efficiently recover “implicit” word and multiword units without additional annotation (Yang, 1 Mar 2024, Karthika et al., 21 Jun 2025).
Pure byte-level tokenisation (e.g., UTF8Tokenizer) is particularly suited for hardware-constrained or streaming scenarios, achieving maximal speed and minimal memory cost at the expense of longer sequences—an acceptable trade-off for many robust LLMs (Moryossef et al., 19 Oct 2025).
7. Connections to Compression and Morphological Segmentation
Bottom-up tokenisation is deeply connected to dictionary-based and grammar-based compression, as well as unsupervised morphological segmentation. The reduction of the tokenisation problem to straight-line programs (context-free grammars in Chomsky normal form) and set cover/maximum coverage instantiates these links formally (Whittington et al., 19 Dec 2024, Lim et al., 8 Jan 2025). Bottom-up tokenisers can be seen as greedy grammar compressors under a rule-budget. Empirical segmentation quality, however, depends on the linguistic fit—pure unsupervised morph-based segmentation (e.g., Morfessor) can yield linguistically coherent units but may not improve downstream model performance over simpler BPE-style merges (Mielke et al., 2021).
A plausible implication is that advances in text compression algorithms and submodular optimisation could inform the development of more effective approximation strategies for bottom-up tokenisation, within the provable performance and efficiency limits delineated by the NP-completeness and inapproximability results.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free