Effect of baseline language choice in parallel-data tokenizer comparisons

Determine whether using baseline languages other than English when comparing tokenizers trained on parallel data changes the observed compression outcomes and conclusions about token premium effects.

Background

Due to data availability, the paper compares each target language only against an English baseline for parallel-data-trained tokenizers, noting this constraint may limit cross-language comparability.

Understanding whether baseline choice affects observed compression differences would strengthen conclusions about the efficacy of training on parallel data.

References

It is unclear whether comparisons with different languages would lead to different results.

— Explaining and Mitigating Crosslingual Tokenizer Inequities (2510.21909 - Arnett et al., 24 Oct 2025) in Section 8 (Limitations)

Effect of baseline language choice in parallel-data tokenizer comparisons

Background

References

Related Problems