- The paper establishes NP-completeness and strict inapproximability for both direct and bottom-up tokenisation over binary and unary alphabets.
- It leverages reductions from MAX2SAT and vertex cover to formalize the optimization framework for compression-based tokenisation.
- The findings imply that heuristic methods like BPE are necessary despite inherent limits on achieving optimal compression.
Computational Hardness of Tokenisation Over Bounded Alphabets
Overview
"Tokenisation over Bounded Alphabets is Hard" (2511.15709) delivers a comprehensive complexity-theoretic analysis of optimal tokenisation under constrained alphabet settings. The paper formalizes two variants of tokenisation (direct and bottom-up) and establishes strong NP-completeness and inapproximability results even for binary and unary alphabets. This work closes key gaps left by prior research, which focused almost exclusively on unbounded alphabets, and has noteworthy implications for subword segmentation algorithms in NLP and deep learning.
Tokenisation, typically the inaugural step in NLP pipelines, maps character-strings from a finite alphabet Σ to subword sequences, utilizing a vocabulary S (direct tokenisation) or a sequence of merges (bottom-up tokenisation, e.g., BPE). The paper emphasizes the compression objective, i.e., minimizing the total subword sequence length over a corpus, reflecting real-world desiderata such as efficient throughput and model training. The main optimization targets are:
- Direct tokenisation: Find a vocabulary S⊆Σ+ such that, with fixed ∣S∣, the dataset is maximally compressed.
- Bottom-up tokenisation: Find an optimal sequence of merges, constructing new subword tokens to minimize total token sequence length.
Complexity Results on Bounded Alphabets
The core contribution is the establishment of computational hardness for these tokenisation objectives with bounded, i.e., small or fixed-size, alphabets. The results are presented in two main sections:
Binary Alphabets
- NP-hardness and PTAS inapproximability: Both direct and bottom-up tokenisation over binary alphabets are NP-complete. Moreover, the paper proves the absence of any polynomial-time approximation scheme (PTAS) for these problems unless P=NP, i.e., no efficient solution can guarantee arbitrary closeness to the optimum.
- Technique: The reductions utilize the 3-occurrence MAX2SAT problem, exploiting its known APX-hardness and gap-preservation properties. This ensures the computational barriers are tight: even minuscule approximation ratios (e.g., $1.000002$ for direct, $1.0000001$ for bottom-up) cannot be reached in polynomial time.
Unary Alphabets
- Direct tokenisation is strongly NP-complete: By leveraging a reduction from the vertex cover problem, the unary (alphabet of size one) direct tokenisation problem inherits strong NP-completeness, i.e., remains intractable even when inputs are represented as explicit unary-strings.
- Equivalent to optimal coin system design: The unary direct tokenisation problem is shown to be isomorphic to the general optimal denomination problem for coinage, and thus equally hard.
- Bottom-up variation: The paper proves that unary optimal pair encoding tokenisation (OPE) is at least weakly NP-complete, via a reduction from the addition chain problem. Standard unary bottom-up tokenisation complexity remains open.
Implications for Practical Tokenisation Algorithms
These results imply that the computational difficulty of optimal tokenisation is intrinsic and not merely a consequence of alphabet size or elaborate token formation. Specifically:
- Heuristic necessity: Algorithms like BPE and UnigramLM, which are greedy or rely on stochastic heuristics, are not concessionary choices but the only feasible options for practical workloads, even when the underlying data uses byte, Unicode, or binary encodings.
- Compression-performance tradeoffs: While compression correlates positively with downstream model performance, there exist fundamental limits on how close to optimal compression any polynomial-time procedure can achieve. This supports empirical observations that further improvements in LLM throughput or training cost via subword design will be marginal.
- Algorithmic focus: Attention should shift towards approximation algorithms with demonstrable bounds, or toward exploring more tractable relaxations and alternatives, as exact algorithms are provably infeasible.
Theoretical Extensions and Open Questions
The work leaves several open directions:
- Approximation bounds: It is unclear whether constant-factor approximations (larger than those implied by gap reductions) are possible for binary or general alphabets.
- Unary bottom-up tokenisation: While direct and OPE variants are classified, the complexity of standard unary bottom-up tokenisation remains unresolved.
- Alternate objectives: Hardness is currently demonstrated only for compression-style objectives; analysis of other criteria (e.g., token frequency, entropy, Renyi efficiency) could reveal objective-dependent tractability.
Conclusion
This paper rigorously demonstrates that optimal tokenisation over bounded alphabets—central to the efficiency of NLP, LLMs, and foundational models—is computationally intractable even in the most restrictive settings. These results not only generalize previous complexity analyses but decisively explain why existing tokenisation methods are inherently heuristic, marking a fundamental limit for future directions in tokeniser research. Future work should prioritize provably good approximation schemes and the paper of alternate objectives under complexity-theoretic constraints.