Papers
Topics
Authors
Recent
2000 character limit reached

Bottom-Up Tokenisation Algorithms

Updated 21 November 2025
  • Bottom-up tokenisation is a data-driven method that builds subword tokens from atomic units, supporting rare words and morphologically complex languages.
  • It is instantiated by algorithms like BPE, WordPiece, and LiB, each balancing factors such as frequency, statistical likelihood, and compression.
  • Due to NP-complete hardness, practical implementations rely on greedy heuristics to optimize token sequence length and vocabulary balance.

Bottom-up tokenisation refers to a class of algorithms that construct token vocabularies by starting from atomic units (typically characters or bytes) and iteratively merging adjacent units to form larger “subword” tokens. Unlike traditional top-down or rule-based tokenisers, which operate over fixed wordlists or linguistic heuristics, bottom-up approaches are entirely data-driven and produce open-vocabulary segmentations, enabling robust handling of rare words, neologisms, and morphologically complex languages (Mielke et al., 2021, Karthika et al., 21 Jun 2025).

1. Theoretical Foundations and Formal Definition

Bottom-up tokenisation is formally framed as the process of applying a sequence of pairwise merge operations to a dataset D={w1,,wN}ΣD=\{w_1,\ldots,w_N\}\subset\Sigma^*, where Σ\Sigma is a finite alphabet. Each merge m=s1,s2m=\langle s_1, s_2\rangle consists of concatenating adjacent subwords s1s_1 and s2s_2 into a new symbol s1s2s_1\circ s_2. A merge sequence M=[m1,,mδ]M=[m_1,\ldots, m_\delta] induces a vocabulary V(M)=Σ{s1s2:s1,s2M}V(M)=\Sigma\cup\{s_1\circ s_2:\langle s_1,s_2\rangle\in M\} with V(M)=Σ+δ|V(M)|=|\Sigma|+\delta. The tokenisation of a string ww is tokM(w)=(mergemδmergem1)(w)tok^\uparrow_M(w)=(merge_{m_\delta}\circ \cdots\circ merge_{m_1})(w).

The canonical objective is to minimise the total output length,

L(D;M)=wDtokM(w)L(D;M) = \sum_{w\in D} |tok^\uparrow_M(w)|

under a budget of at most δ\delta merges (i.e., fixed vocabulary size). The central decision problem is, given (D,δ,K)(D, \delta, K), does there exist a merge sequence MM of length δ\leq\delta such that L(D;M)KL(D;M)\leq K? This is the “bottom-up tokenisation” decision problem (Whittington et al., 19 Dec 2024, Kastreva et al., 19 Nov 2025).

2. Algorithmic Instantiations: Byte-Pair Encoding, WordPiece, LiB, and Variants

The most widely used bottom-up tokenisation algorithm is Byte-Pair Encoding (BPE), which iteratively selects the most frequent adjacent symbol pair in the corpus and merges it into a new token. Each merge expands the vocabulary by one and the process repeats until the desired vocabulary size is reached. The segmentation process employs a longest-match-first strategy over the constructed vocabulary (Mielke et al., 2021, Karthika et al., 21 Jun 2025).

WordPiece modifies BPE by selecting merges that maximise an n-gram LLM likelihood increase rather than pure frequency, aligning merges more closely with statistical regularities (Mielke et al., 2021).

Less-is-Better (LiB) tokenisation, a cognitive-scientifically motivated extension, alternates between a memoriser phase (adding n-gram units that yield net token savings exceeding the type penalty) and a forgetter phase (removing units whose presence increases the total cost). The cost function to be minimised is C(V)=αT(D,V)+βM(V)C(V) = \alpha T(D,V) + \beta M(V), where T(D,V)T(D,V) and M(V)M(V) are the number of tokens and types, respectively. Empirically, LiB yields compression and vocabulary balance improvements over BPE (Yang, 1 Mar 2024).

Partition-cover approaches, such as GreedTok, reformulate bottom-up tokenisation as maximum coverage set function optimisation. Here, the goal is to select a vocabulary SS of size at most kk maximising F(S)=WcW  cover(W,S)F(S)=\sum_{W} c_W\; \mathrm{cover}(W,S), i.e., covering as many adjacent pairs as possible with non-overlapping substrings/tokens (Lim et al., 8 Jan 2025).

3. Computational Complexity and Hardness Results

Bottom-up tokenisation is computationally intractable in the worst case. Both “Tokenisation is NP-Complete” (Whittington et al., 19 Dec 2024) and “Tokenisation over Bounded Alphabets is Hard” (Kastreva et al., 19 Nov 2025) rigorously prove that, even under restrictive conditions (e.g., Σ\Sigma binary and merges limited), the problem is NP-complete and does not admit a polynomial-time approximation scheme unless P=NP. The proofs employ reductions from MAX-2-SAT and VERTEX COVER; for the partition-cover variant, NP-hardness is shown directly via a VERTEX COVER reduction (Lim et al., 8 Jan 2025).

Moreover, even approximating the minimal token sequence within a factor of 1+ϵ1+\epsilon for any constant ϵ>0\epsilon>0 is NP-hard—a fundamental barrier that explains why all practical bottom-up tokenisation algorithms are heuristic and greedy (Whittington et al., 19 Dec 2024, Kastreva et al., 19 Nov 2025, Lim et al., 8 Jan 2025). In special cases where the merge budget or alphabet is fixed and small, brute-force enumeration is possible, but this is of theoretical rather than practical significance (Whittington et al., 19 Dec 2024).

4. Empirical Performance and Cross-Linguistic Considerations

Empirically, merge-based tokenisers such as BPE and its variants show consistently strong performance in language modeling and translation tasks across a variety of languages—particularly for those with high morphological complexity or limited annotated data (Mielke et al., 2021, Karthika et al., 21 Jun 2025). Bottom-up tokenisation yields open vocabularies, allowing any out-of-vocabulary sequence to be decomposed to the atomic level without mapping to “unk.”

Comparative studies across 17 Indian languages reveal nuanced trade-offs: at small vocabulary sizes, Unigram LM (a top-down probabilistic model) slightly outperforms BPE in aligning with morpheme boundaries and minimising word fragmentation, particularly in highly inflected languages. However, as vocabulary size increases (128K+), BPE closes this gap and can even surpass Unigram LM on fertility and word fragmentation rate. Larger vocabularies reliably yield fewer tokens per word and increased subword length, decreasing the sequence length required by the LLM but increasing memory and compute cost in the embedding and output layers (Karthika et al., 21 Jun 2025).

In multilingual settings, bottom-up tokenisation on pooled corpora typically biases vocabulary construction toward high-resource languages. Cluster-based or language-adaptive vocabulary strategies mitigate this effect, improving segmentation fairness and downstream performance for low-resource languages. Zero-shot transfer of tokenisers from high-resource to low-resource related languages is empirically viable, with reasonable fertility and compression retained (Karthika et al., 21 Jun 2025).

5. Byte-Level and Partition Approaches

Byte-level bottom-up tokenisation, as instantiated by UTF8Tokenizer, bypasses traditional subword merging entirely: each UTF-8 byte is treated as its own atomic token, mapping directly to its numerical value in [0,255][0,255]. No out-of-range or auxiliary IDs are used. C0 ASCII control bytes (0x00–0x1F, 0x7F) serve as unambiguous markers for special functions (padding, sequence boundaries, segment demarcation). The embedding is a simple 256×d256 \times d matrix, trivial to share across models. Enhancements such as bit-biased embeddings exploit internal byte structure at training without increasing inference cost. This approach yields 14× faster tokenisation and 8× lower host-to-device memory requirements than standard BPE-style methods, and yields modest improvements in language modeling convergence (Moryossef et al., 19 Oct 2025).

Partition-cover approaches, typified by GreedTok, select tokens (substrings) to maximise non-overlapping coverage under a size constraint. GreedTok empirically achieves 3–5% better compression than BPE at comparable vocabulary sizes and approaches the theoretical (11/e)(1 - 1/e) maximum coverage approximation ratio. The selected token sets diverge substantially from those of BPE beyond the first few hundred merges, reflecting the restrictive local optimality of frequency-based merging (Lim et al., 8 Jan 2025).

Algorithm Greedy/Heuristic Objective Optimised Notable Empirical Result
BPE Yes Most frequent pair merging 30k–50k merges optimal for English (Mielke et al., 2021)
LiB Yes αT+βM\alpha T + \beta M Lowest bits-per-character on CTB8/BRphon (Yang, 1 Mar 2024)
GreedTok Yes Partition-cover set function 3–5% fewer tokens/word vs. BPE (Lim et al., 8 Jan 2025)
UTF8Tokenizer Not applicable None—bytes are tokens 14× speedup, 8× less memory (Moryossef et al., 19 Oct 2025)

6. Practical Implications and Best Practices

The computational intractability of optimal bottom-up tokenisation necessitates empirical, greedy, or approximation-based solutions. BPE, LiB, partition-cover relaxations, and other heuristic models all operate under this constraint. For LLM pretraining, partition-cover approaches such as GreedTok provide measurable reductions in token counts and vocabulary sizes, while also supporting integration of domain- or language-specific substrings. Bottom-up frameworks naturally support open vocabulary, straightforward handling of neologisms, and compositional robustness.

Practitioners should tune vocabulary size to the morphological properties of the target language; e.g., 30–50k merges suffice for English, while lower sizes yield better generalisation in polysynthetic languages. For multilingual applications, cluster-based or language-adaptive variants are recommended. In non-Latin scripts lacking explicit word boundaries, cognitively motivated models such as LiB efficiently recover “implicit” word and multiword units without additional annotation (Yang, 1 Mar 2024, Karthika et al., 21 Jun 2025).

Pure byte-level tokenisation (e.g., UTF8Tokenizer) is particularly suited for hardware-constrained or streaming scenarios, achieving maximal speed and minimal memory cost at the expense of longer sequences—an acceptable trade-off for many robust LLMs (Moryossef et al., 19 Oct 2025).

7. Connections to Compression and Morphological Segmentation

Bottom-up tokenisation is deeply connected to dictionary-based and grammar-based compression, as well as unsupervised morphological segmentation. The reduction of the tokenisation problem to straight-line programs (context-free grammars in Chomsky normal form) and set cover/maximum coverage instantiates these links formally (Whittington et al., 19 Dec 2024, Lim et al., 8 Jan 2025). Bottom-up tokenisers can be seen as greedy grammar compressors under a rule-budget. Empirical segmentation quality, however, depends on the linguistic fit—pure unsupervised morph-based segmentation (e.g., Morfessor) can yield linguistically coherent units but may not improve downstream model performance over simpler BPE-style merges (Mielke et al., 2021).

A plausible implication is that advances in text compression algorithms and submodular optimisation could inform the development of more effective approximation strategies for bottom-up tokenisation, within the provable performance and efficiency limits delineated by the NP-completeness and inapproximability results.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bottom-Up Tokenisation.