Constant-factor approximability of binary tokenisation optimisation problems

Determine whether any polynomial-time constant-factor approximation algorithm exists for the binary direct tokenisation optimisation problem and for the binary bottom-up tokenisation optimisation problem under the compressed-length objective, i.e., decide whether there is a constant c > 1 such that a polynomial-time algorithm can achieve approximation ratio at most c on all instances, or establish that no such constant-factor approximation is achievable.

Background

The paper proves that, over a binary alphabet, both direct and bottom-up tokenisation optimisation problems are not in PTAS (unless P = NP), via gap-preserving reductions from 3-OCC-MAX2SAT. These results rule out arbitrarily good polynomial-time approximation but leave open whether any fixed constant-factor approximation is attainable.

The authors provide very small inapproximability constants (strictly larger than 1) and suggest that these lower bounds could likely be improved. However, they explicitly state uncertainty about the existence of any constant-factor approximation at all, motivating a precise classification of approximability for these binary tokenisation problems.

References

A number of open questions remain, however, in particular with respect to approximability. For instance, while we showed that the binary tokenisation optimisation problems cannot be approximated arbitrarily well (unless P)—and while it seems likely that the lower bound provided in the proof of \cref{thm:dbtok_hardapx} can be significantly lifted—it is unclear whether any constant approximation ratio can even be obtained.

— Tokenisation over Bounded Alphabets is Hard (2511.15709 - Kastreva et al., 19 Nov 2025) in Conclusion and Limitations (Section 6)

Constant-factor approximability of binary tokenisation optimisation problems

Background

References

Related Problems