Dice Question Streamline Icon: https://streamlinehq.com

Efficiently finding an optimal tokeniser given a specified objective function

Determine whether there exists an efficient (e.g., polynomial-time) algorithm that, given a specified tokenisation objective function and a dataset, constructs a tokeniser that maximises the objective.

Information Square Streamline Icon: https://streamlinehq.com

Background

Common tokenisation algorithms such as byte pair encoding (BPE) and UnigramLM are heuristic or greedy and do not guarantee optimality under their respective objectives.

The authors partially address this by proving NP-completeness for variants under a compression objective, but the general question of efficient optimal tokeniser discovery given an arbitrary specified objective remains explicitly open.

References

Another open question is how to—given such an objective function—efficiently find a tokeniser which maximises it.

Tokenisation is NP-Complete (2412.15210 - Whittington et al., 19 Dec 2024) in Section 1 (Introduction)