Compression–performance relationship in language models

Determine how tokenization-driven compression relates to language model performance on downstream tasks across languages and tasks, explicitly assessing whether higher compression (fewer tokens for equivalent content) improves, harms, or has no effect on performance.

Background

The paper reviews conflicting evidence on whether tokenizer compression correlates with downstream performance: some studies report positive correlations, while others find no relationship when carefully controlling tokenizer properties.

Because compression directly affects training and inference costs and may influence the amount of information per sequence, establishing a clear relationship is important for both fairness and efficiency across languages.

References

How compression relates to LLM performance on downstream tasks remains uncertain.

— Explaining and Mitigating Crosslingual Tokenizer Inequities (2510.21909 - Arnett et al., 24 Oct 2025) in Section 2.3 (Related Work — The Relationship Between Compression and Performance)

Compression–performance relationship in language models

Sponsor

Background

References

Related Problems