- The paper introduces a causal inference approach using regression discontinuity design to isolate tokenisation bias in language models.
- It shows that including subwords in the vocabulary can increase the probability of their characters by up to 17 times in small models.
- Findings indicate that larger models exhibit reduced tokenisation bias, emphasizing implications for model fairness and tokeniser design.
Causal Estimation of Tokenisation Bias in LLMs
The paper "Causal Estimation of Tokenisation Bias" explores the nuanced issue of tokenisation bias in LMs. Authored by researchers from the University of Cambridge and ETH Zurich, this work explores how subword representation in tokenisers affects a model's learned probability distributions over character strings. Their approach incorporates causal inference techniques to quantitatively measure the extent of this bias, advancing our understanding of tokeniser design choices on model behavior.
Overview and Methodology
LLMs, which traditionally operate over subword sequences stemming from tokenisers, must reconcile with probabilities over character strings during inference. The researchers focus on a specific type of tokenisation bias—how including or excluding a subword from the tokeniser's vocabulary impacts the probability assigned to its corresponding characters. The challenge arises as each model is trained with a single tokeniser, making straightforward comparisons impractical.
To address this, the authors employ a causal framework using a regression discontinuity design. This approach leverages the fact that tokenisation algorithms rank subwords, adding them to a vocabulary based on a predefined cutoff. By comparing similar subwords around this cutoff, the paper aims to isolate and quantify the causal impact of the tokenisation choice on model output.
Key Findings
The empirical results demonstrate that tokenisation bias is significant and consistent across different models, tokenisers, and vocabulary sizes. For instance, in a model with a small vocabulary, a subword present in the vocabulary can increase the probability of its characters by up to 17 times. The analyses use models trained with Byte Pair Encoding (BPE) and WordPiece (WP) tokenisers across various scales.
Moreover, the effect of tokenisation bias was found to grow throughout training, which is counterintuitive since a perfect LLM should theoretically display no tokenisation bias. Further, larger models exhibited less tokenisation bias compared to smaller counterparts, suggesting an inverse relationship between model capacity and susceptibility to tokenisation bias.
Theoretical and Practical Implications
The findings have several theoretical implications. They emphasize the importance of considering tokenisation choices during model pretraining, as these influence both the reliability and fairness of the model outputs. Especially in multilingual contexts, where tokenisation length and complexity can affect performance, understanding tokenisation bias is critical for achieving more equitable model behavior across languages.
Practically, these insights could inform future tokeniser design, providing a basis for more informed vocabulary selection. The causal analysis framework proposed could be employed to evaluate tokenisation efficiency and improve LLM generalization across lexical variants.
Future Directions
This research opens avenues for further exploration into optimizing tokenisation strategies. Future studies could explore tokenisation biases in other architectures and training paradigms, particularly focusing on the balance between computational efficiency and model performance. Additionally, investigating the trade-offs between subword granularity and semantic richness could yield valuable insights into achieving the optimal tokenisation scheme for specific applications.
Conclusion
In conclusion, this paper presents a methodologically rigorous investigation into tokenisation bias in LLMs, employing causal inference to shed light on the impact of tokeniser vocabulary decisions. The researchers offer a novel perspective on a well-recognized phenomenon, providing actionable insights for both the practical development of language technologies and the theoretical advancement of LLM evaluations. The paper’s emphasis on causal effects encourages a deeper understanding of how tokenisation strategies can shape the capabilities and limitations of LLMs.