Effect of baseline language choice in parallel-data tokenizer comparisons
Determine whether using baseline languages other than English when comparing tokenizers trained on parallel data changes the observed compression outcomes and conclusions about token premium effects.
References
It is unclear whether comparisons with different languages would lead to different results.
— Explaining and Mitigating Crosslingual Tokenizer Inequities
(2510.21909 - Arnett et al., 24 Oct 2025) in Section 8 (Limitations)