Dice Question Streamline Icon: https://streamlinehq.com

Characterize what makes a good tokeniser for language modelling

Determine which properties of subword vocabularies and the resulting tokenisations lead to strong downstream language modelling performance; specifically, identify the characteristics of the produced subwords that make a tokeniser an effective starting point for language modelling so that an explicit evaluation objective can be defined.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper motivates the need to understand tokenisation quality because LLMs are trained over subword sequences rather than raw characters. Despite tokenisation’s centrality, there is no agreed-upon characterization of what constitutes a good tokeniser in terms of properties of the subwords it produces.

Clarifying these properties would enable the formulation of a principled objective function for evaluating and selecting tokenisers, beyond current practice that relies on proxy objectives.

References

We still do not know, for instance, what makes a good tokeniser \citep{gowda-may-2020-finding,cognetta-etal-2024-two}: which characteristics should its produced subwords $\subwords$ have to be a good starting point for language modelling?

Tokenisation is NP-Complete (2412.15210 - Whittington et al., 19 Dec 2024) in Section 1 (Introduction)