Dice Question Streamline Icon: https://streamlinehq.com

Hardness for alternative objectives and tokenisation variants beyond direct and bottom-up

Establish the computational hardness (decision and approximation) of tokenisation when the optimisation objective is not compression and for tokenisation variants other than direct encoding and bottom-up encoding, in particular by classifying the complexity and approximability of these alternative objectives and variants over bounded alphabets.

Information Square Streamline Icon: https://streamlinehq.com

Background

Throughout the paper the analysis focuses on compression as the objective and on two variants—direct and bottom-up tokenisation. While these choices cover common settings, other objectives (such as those used by probabilistic tokenisers) and other tokenisation paradigms were not analysed.

The authors explicitly state that determining the hardness for both alternative objectives and other tokenisation variants remains an open direction, highlighting a gap in the current complexity landscape of tokenisation.

References

Finally, the results of our work are limited in that we consider (i) compression as objective, and (ii) bottom-up and direct tokenisation only; the hardness of both other objectives and variants remains open.

Tokenisation over Bounded Alphabets is Hard (2511.15709 - Kastreva et al., 19 Nov 2025) in Conclusion and Limitations (Section 6)