Measuring downstream language modelling performance without training a model

Develop a computationally feasible measure or estimator of downstream language modelling performance that can be used to evaluate or optimise tokenisers without fully training a language model.

Background

Ideally, tokenisers would be selected to maximise downstream performance, but direct evaluation would require fully training models, which is computationally prohibitive.

In practice, researchers resort to proxy objectives (e.g., unigram log-probability, Rényi efficiency, compression). A principled, efficient performance measure would enable direct optimisation aligned with end goals.

References

Unfortunately, we do not know how to measure such performance without fully training a model, making its direct maximisation computationally infeasible.

— Tokenisation is NP-Complete (2412.15210 - Whittington et al., 19 Dec 2024) in Section 2 (How to Choose a Tokeniser?)

Measuring downstream language modelling performance without training a model

Sponsor

Background

References

Related Problems