Impact of linguistically aligned tokenization on model performance

Ascertain whether tokenizers that produce linguistically aligned tokens (e.g., morpheme-like or morphologically coherent units) lead to improved downstream language model performance compared to tokenizers that segment text into less linguistically aligned units.

Background

Unigram tokenizers can yield more linguistically aligned segments than BPE, but this paper finds Unigram has worse compression across languages, and prior work offers mixed evidence on performance implications.

Clarifying the effect of linguistic alignment on performance would guide tokenizer selection and design, especially for languages with complex morphology.

References

It remains unclear, however, whether having linguistically-aligned tokens is clearly linked to better LLM performance.

— Explaining and Mitigating Crosslingual Tokenizer Inequities (2510.21909 - Arnett et al., 24 Oct 2025) in Section 6 (Discussion — The Link Between Compression and Performance)

Impact of linguistically aligned tokenization on model performance

Sponsor

Background

References

Related Problems