Characterize what makes a good tokeniser for language modelling

Determine which properties of subword vocabularies and the resulting tokenisations lead to strong downstream language modelling performance; specifically, identify the characteristics of the produced subwords that make a tokeniser an effective starting point for language modelling so that an explicit evaluation objective can be defined.

Background

The paper motivates the need to understand tokenisation quality because LLMs are trained over subword sequences rather than raw characters. Despite tokenisation’s centrality, there is no agreed-upon characterization of what constitutes a good tokeniser in terms of properties of the subwords it produces.

Clarifying these properties would enable the formulation of a principled objective function for evaluating and selecting tokenisers, beyond current practice that relies on proxy objectives.

References

We still do not know, for instance, what makes a good tokeniser \citep{gowda-may-2020-finding,cognetta-etal-2024-two}: which characteristics should its produced subwords $\subwords$ have to be a good starting point for language modelling?

Tokenisation is NP-Complete (2412.15210 - Whittington et al., 19 Dec 2024) in Section 1 (Introduction)