Papers
Topics
Authors
Recent
2000 character limit reached

Tokenisation over Bounded Alphabets is Hard (2511.15709v1)

Published 19 Nov 2025 in cs.CL, cs.DS, and cs.LG

Abstract: Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.

Summary

  • The paper establishes NP-completeness and strict inapproximability for both direct and bottom-up tokenisation over binary and unary alphabets.
  • It leverages reductions from MAX2SAT and vertex cover to formalize the optimization framework for compression-based tokenisation.
  • The findings imply that heuristic methods like BPE are necessary despite inherent limits on achieving optimal compression.

Computational Hardness of Tokenisation Over Bounded Alphabets

Overview

"Tokenisation over Bounded Alphabets is Hard" (2511.15709) delivers a comprehensive complexity-theoretic analysis of optimal tokenisation under constrained alphabet settings. The paper formalizes two variants of tokenisation (direct and bottom-up) and establishes strong NP-completeness and inapproximability results even for binary and unary alphabets. This work closes key gaps left by prior research, which focused almost exclusively on unbounded alphabets, and has noteworthy implications for subword segmentation algorithms in NLP and deep learning.

Formalization of Tokenisation Objectives and Structures

Tokenisation, typically the inaugural step in NLP pipelines, maps character-strings from a finite alphabet Σ\Sigma to subword sequences, utilizing a vocabulary S\mathcal{S} (direct tokenisation) or a sequence of merges (bottom-up tokenisation, e.g., BPE). The paper emphasizes the compression objective, i.e., minimizing the total subword sequence length over a corpus, reflecting real-world desiderata such as efficient throughput and model training. The main optimization targets are:

  • Direct tokenisation: Find a vocabulary S⊆Σ+\mathcal{S} \subseteq \Sigma^+ such that, with fixed ∣S∣\vert\mathcal{S}\vert, the dataset is maximally compressed.
  • Bottom-up tokenisation: Find an optimal sequence of merges, constructing new subword tokens to minimize total token sequence length.

Complexity Results on Bounded Alphabets

The core contribution is the establishment of computational hardness for these tokenisation objectives with bounded, i.e., small or fixed-size, alphabets. The results are presented in two main sections:

Binary Alphabets

  • NP-hardness and PTAS inapproximability: Both direct and bottom-up tokenisation over binary alphabets are NP-complete. Moreover, the paper proves the absence of any polynomial-time approximation scheme (PTAS) for these problems unless P=NP, i.e., no efficient solution can guarantee arbitrary closeness to the optimum.
  • Technique: The reductions utilize the 3-occurrence MAX2SAT problem, exploiting its known APX-hardness and gap-preservation properties. This ensures the computational barriers are tight: even minuscule approximation ratios (e.g., $1.000002$ for direct, $1.0000001$ for bottom-up) cannot be reached in polynomial time.

Unary Alphabets

  • Direct tokenisation is strongly NP-complete: By leveraging a reduction from the vertex cover problem, the unary (alphabet of size one) direct tokenisation problem inherits strong NP-completeness, i.e., remains intractable even when inputs are represented as explicit unary-strings.
  • Equivalent to optimal coin system design: The unary direct tokenisation problem is shown to be isomorphic to the general optimal denomination problem for coinage, and thus equally hard.
  • Bottom-up variation: The paper proves that unary optimal pair encoding tokenisation (OPE) is at least weakly NP-complete, via a reduction from the addition chain problem. Standard unary bottom-up tokenisation complexity remains open.

Implications for Practical Tokenisation Algorithms

These results imply that the computational difficulty of optimal tokenisation is intrinsic and not merely a consequence of alphabet size or elaborate token formation. Specifically:

  • Heuristic necessity: Algorithms like BPE and UnigramLM, which are greedy or rely on stochastic heuristics, are not concessionary choices but the only feasible options for practical workloads, even when the underlying data uses byte, Unicode, or binary encodings.
  • Compression-performance tradeoffs: While compression correlates positively with downstream model performance, there exist fundamental limits on how close to optimal compression any polynomial-time procedure can achieve. This supports empirical observations that further improvements in LLM throughput or training cost via subword design will be marginal.
  • Algorithmic focus: Attention should shift towards approximation algorithms with demonstrable bounds, or toward exploring more tractable relaxations and alternatives, as exact algorithms are provably infeasible.

Theoretical Extensions and Open Questions

The work leaves several open directions:

  • Approximation bounds: It is unclear whether constant-factor approximations (larger than those implied by gap reductions) are possible for binary or general alphabets.
  • Unary bottom-up tokenisation: While direct and OPE variants are classified, the complexity of standard unary bottom-up tokenisation remains unresolved.
  • Alternate objectives: Hardness is currently demonstrated only for compression-style objectives; analysis of other criteria (e.g., token frequency, entropy, Renyi efficiency) could reveal objective-dependent tractability.

Conclusion

This paper rigorously demonstrates that optimal tokenisation over bounded alphabets—central to the efficiency of NLP, LLMs, and foundational models—is computationally intractable even in the most restrictive settings. These results not only generalize previous complexity analyses but decisively explain why existing tokenisation methods are inherently heuristic, marking a fundamental limit for future directions in tokeniser research. Future work should prioritize provably good approximation schemes and the paper of alternate objectives under complexity-theoretic constraints.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 39 likes.

Upgrade to Pro to view all of the tweets about this paper: