Compression–performance relationship in language models
Determine how tokenization-driven compression relates to language model performance on downstream tasks across languages and tasks, explicitly assessing whether higher compression (fewer tokens for equivalent content) improves, harms, or has no effect on performance.
References
How compression relates to LLM performance on downstream tasks remains uncertain.
— Explaining and Mitigating Crosslingual Tokenizer Inequities
(2510.21909 - Arnett et al., 24 Oct 2025) in Section 2.3 (Related Work — The Relationship Between Compression and Performance)