Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 43 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 88 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 415 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Causal Estimation of Tokenisation Bias (2506.03149v1)

Published 3 Jun 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Modern LLMs are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser -- which maps character-strings to subwords -- should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as tokenisation bias. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., $\langle hello \rangle$) in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters (i.e., \textit{``hello''}). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first $K$ to a tokeniser's vocabulary, where $K$ is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers. Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times, highlighting tokenisation as a key design choice in LLMling.

Summary

The paper introduces a causal inference approach using regression discontinuity design to isolate tokenisation bias in language models.
It shows that including subwords in the vocabulary can increase the probability of their characters by up to 17 times in small models.
Findings indicate that larger models exhibit reduced tokenisation bias, emphasizing implications for model fairness and tokeniser design.

Causal Estimation of Tokenisation Bias in LLMs

The paper "Causal Estimation of Tokenisation Bias" explores the nuanced issue of tokenisation bias in LMs. Authored by researchers from the University of Cambridge and ETH Zurich, this work explores how subword representation in tokenisers affects a model's learned probability distributions over character strings. Their approach incorporates causal inference techniques to quantitatively measure the extent of this bias, advancing our understanding of tokeniser design choices on model behavior.

Overview and Methodology

LLMs, which traditionally operate over subword sequences stemming from tokenisers, must reconcile with probabilities over character strings during inference. The researchers focus on a specific type of tokenisation bias—how including or excluding a subword from the tokeniser's vocabulary impacts the probability assigned to its corresponding characters. The challenge arises as each model is trained with a single tokeniser, making straightforward comparisons impractical.

To address this, the authors employ a causal framework using a regression discontinuity design. This approach leverages the fact that tokenisation algorithms rank subwords, adding them to a vocabulary based on a predefined cutoff. By comparing similar subwords around this cutoff, the paper aims to isolate and quantify the causal impact of the tokenisation choice on model output.

Key Findings

The empirical results demonstrate that tokenisation bias is significant and consistent across different models, tokenisers, and vocabulary sizes. For instance, in a model with a small vocabulary, a subword present in the vocabulary can increase the probability of its characters by up to 17 times. The analyses use models trained with Byte Pair Encoding (BPE) and WordPiece (WP) tokenisers across various scales.

Moreover, the effect of tokenisation bias was found to grow throughout training, which is counterintuitive since a perfect LLM should theoretically display no tokenisation bias. Further, larger models exhibited less tokenisation bias compared to smaller counterparts, suggesting an inverse relationship between model capacity and susceptibility to tokenisation bias.

Theoretical and Practical Implications

The findings have several theoretical implications. They emphasize the importance of considering tokenisation choices during model pretraining, as these influence both the reliability and fairness of the model outputs. Especially in multilingual contexts, where tokenisation length and complexity can affect performance, understanding tokenisation bias is critical for achieving more equitable model behavior across languages.

Practically, these insights could inform future tokeniser design, providing a basis for more informed vocabulary selection. The causal analysis framework proposed could be employed to evaluate tokenisation efficiency and improve LLM generalization across lexical variants.

Future Directions

This research opens avenues for further exploration into optimizing tokenisation strategies. Future studies could explore tokenisation biases in other architectures and training paradigms, particularly focusing on the balance between computational efficiency and model performance. Additionally, investigating the trade-offs between subword granularity and semantic richness could yield valuable insights into achieving the optimal tokenisation scheme for specific applications.

Conclusion

In conclusion, this paper presents a methodologically rigorous investigation into tokenisation bias in LLMs, employing causal inference to shed light on the impact of tokeniser vocabulary decisions. The researchers offer a novel perspective on a well-recognized phenomenon, providing actionable insights for both the practical development of language technologies and the theoretical advancement of LLM evaluations. The paper’s emphasis on causal effects encourages a deeper understanding of how tokenisation strategies can shape the capabilities and limitations of LLMs.