Characterize what makes a good tokeniser for language modelling
Determine which properties of subword vocabularies and the resulting tokenisations lead to strong downstream language modelling performance; specifically, identify the characteristics of the produced subwords that make a tokeniser an effective starting point for language modelling so that an explicit evaluation objective can be defined.
References
We still do not know, for instance, what makes a good tokeniser \citep{gowda-may-2020-finding,cognetta-etal-2024-two}: which characteristics should its produced subwords $\subwords$ have to be a good starting point for language modelling?
— Tokenisation is NP-Complete
(2412.15210 - Whittington et al., 19 Dec 2024) in Section 1 (Introduction)