Impact of linguistically aligned tokenization on model performance
Ascertain whether tokenizers that produce linguistically aligned tokens (e.g., morpheme-like or morphologically coherent units) lead to improved downstream language model performance compared to tokenizers that segment text into less linguistically aligned units.
References
It remains unclear, however, whether having linguistically-aligned tokens is clearly linked to better LLM performance.
— Explaining and Mitigating Crosslingual Tokenizer Inequities
(2510.21909 - Arnett et al., 24 Oct 2025) in Section 6 (Discussion — The Link Between Compression and Performance)