Dice Question Streamline Icon: https://streamlinehq.com

Fair crosslingual computation of fertility

Develop a fair, crosslinguistically valid procedure for computing fertility—defined as the average number of tokens per word—that can be consistently applied across languages with differing notions of wordhood and orthographic conventions.

Information Square Streamline Icon: https://streamlinehq.com

Background

Fertility is a common compression metric (tokens per word) but depends on a functional definition of wordhood, which is highly variable across languages and sometimes absent (e.g., languages without whitespace).

Given that languages encode different amounts of information per word, a fair, language-agnostic calculation of fertility is necessary for equitable crosslingual evaluation of tokenizer compression.

References

We thus opt not to use fertility, as it is unclear how to fairly calculate this across different languages.

Explaining and Mitigating Crosslingual Tokenizer Inequities (2510.21909 - Arnett et al., 24 Oct 2025) in Section 2.2 (Related Work — Measuring Compression)