Dice Question Streamline Icon: https://streamlinehq.com

Complexity results for other tokenisation variants and objectives

Establish computational complexity results for additional tokenisation variants beyond the direct and bottom-up compression formulations, particularly for variants that employ alternative objective functions such as unigram log-probability or Rényi efficiency.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper proves NP-completeness for direct and bottom-up tokenisation under a compression objective, leaving open whether similar hardness (or tractability) results hold for other tokenisation paradigms or objective functions.

Clarifying these complexities would guide the design of algorithms and set expectations regarding the feasibility of exact optimisation for a broader class of tokenisation goals.

References

While we investigated the complexity of two forms of tokenisation, similar results for other variants (e.g., with other objective functions) remain open; this would be exciting future work.

Tokenisation is NP-Complete (2412.15210 - Whittington et al., 19 Dec 2024) in Conclusion