Causal Estimation of Tokenisation Bias
- The paper introduces a causal estimation framework employing regression discontinuity to quantify how tokeniser subword inclusion affects model output probabilities.
- It details experiments on LLaMA-style Transformers with various tokenisers and vocabulary sizes, showing significant probability jumps and aggregate bias metrics.
- The study highlights practical implications for multilingual fairness and economic costs while proposing algorithmic debiasing strategies like BPTree to mitigate tokenisation bias.
Causal estimation of tokenisation bias concerns quantifying the effect that a tokeniser’s subword vocabulary construction has on LLM (LM) behavior, specifically on the probabilities assigned to character-strings. While idealized LMs would be invariant to tokenisation choices, in practice, the presence or absence of specific subwords in the vocabulary exerts a causal influence on model output, resulting in systematic disparities—termed tokenisation bias—across contexts, languages, and model architectures. Recent work rigorously formalizes this bias, primarily via a regression discontinuity design leveraging the constructional properties of standard subword tokenisers (Lesci et al., 3 Jun 2025). Complementary research investigates the downstream consequences for multilingual fairness, inference efficiency, and economic costs (Lundin et al., 5 Sep 2025), while related studies detail general sampling biases arising from token-based autoregressive decoders (Phan et al., 2024).
1. Formal Definition of Tokenisation Bias
Tokenisation bias is defined as the change in the probability a model assigns to a character-string due solely to the inclusion or exclusion of its corresponding subword in the tokeniser's vocabulary. Let index subwords and represent the character-string spanned by subword . Define the treatment indicator: Potential outcomes are:
where the expectation is over all preceding contexts . The individual tokenisation bias (Individual Treatment Effect) is: with the average tokenisation bias given by: This structure frames tokenisation bias explicitly as a causal estimand, focusing on the direct effect of subword presence in the vocabulary (Lesci et al., 3 Jun 2025).
2. Causal Estimation via Regression Discontinuity
Bottom-up tokenisation algorithms (e.g., BPE, WordPiece) construct a ranked sequence of candidate merges ; only the first merges constitute the active vocabulary. For each subword with rank , the treatment assignment is . The cutoff (the vocabulary size limit) is exogenous, yielding a natural regression discontinuity (RD) framework.
The causal RD estimand is: In practice, a local linear regression is fitted within rank bandwidth about : The coefficient estimates the local average treatment effect at the cutoff.
To ensure identification, the primary assumption is the continuity of expected potential outcomes in around . Empirical robustness is checked by varying , employing kernel-weighted fits, and applying LOESS curves on each cutoff side. Exclusion of nested subwords is required to satisfy non-interference (SUTVA).
3. Empirical Quantification and Key Findings
Experiments employ LLaMA-style Transformers (6–24 layers, 57M–850M non-embedding parameters) trained on an English corpus (“MiniPile”). Multiple tokenisers are used: vanilla BPE, WordPiece, and a composite “BPE2WP” (WordPiece applied over a BPE vocabulary), with vocabulary sizes .
For each model, in-context log-probabilities are computed for all vocabulary-eligible subwords on a held-out validation set, and the RD estimator is applied with .
Key results:
- For a 57M-parameter model with BPE at k, nats, corresponding to a higher probability for subwords present in the vocabulary compared to those just excluded.
- Confidence intervals exclude zero by a wide margin (e.g., nats).
- Smaller vocabularies (k) yield somewhat lower bias ( nats); larger vocabularies ($128$k) show similar effects.
- Across tokenisers, estimated jumps range from $2.5$ to $3.2$ nats.
- Increasing model size reduces the bias: $340$M nats, $850$M nat.
- Training dynamics: at initialization, bias reflects entropy (); sharp decline after initial steps, with steady rise to asymptotic values by step $50$k.
Subwords included in the vocabulary exhibit substantially higher stability in output log-probability across contexts than those marginally excluded (Lesci et al., 3 Jun 2025).
4. Downstream Impact: Length, Accuracy, and Economic Effects
Tokenisation bias manifests as a “length bias”: strings requiring multiple subwords to encode receive systematically lower probabilities than those mapped to a single subword, after controlling for frequency and context. This directly affects downstream metrics such as perplexity, generation stability, and, crucially, fairness in multilingual scenarios.
In multilingual LLM evaluation, token inflation—measured via fertility
(where is the total token count for whitespace-separated words)—serves as a proxy for tokenisation inefficiency. Lundin et al. show that higher fertility predicts lower MCQA accuracy across African languages, with OLS regression slopes ranging from to (i.e., each additional token/word correlates with an $8$–$18$ point drop in accuracy). Fertility explains $20$–$50$\% of accuracy variance per (model, subject) pair (Lundin et al., 5 Sep 2025).
Given transformers’ quadratic scaling in sequence length, doubling fertility quadruples training cost and time: For instance, if English () requires \$105M in training cost, a language withincurs \$420M for equivalent data. Inference latency and per-token cost also scale with fertility.
A plausible implication is that tokenisation-induced disparities disproportionately penalize morphologically rich, low-resource languages, both in model accuracy and in economic resources.
5. Causal Mechanisms and Algorithmic Mitigation
Tokenisation bias emerges via the interaction between tokeniser design and autoregressive decoding. Standard encoding schemes such as BPE and maximum prefix encoding (MPE) introduce an irreducible bias: the model’s estimated next-character distribution (where is the token sequence) can diverge systematically from the token-free ground truth .
Explicitly, for prompt and character : where is the set of subwords starting with .
Algorithmic debiasing strategies are feasible. The Branch-and-Pass Tree (BPTree) estimator reconstructs unbiased character-level probabilities from any tokenized LM without retraining, operating via brute-force marginalization over subwords starting with specified character suffixes. This approach has been empirically validated: in controlled Markov chain settings, BPTree recovers character-level transition probabilities, whereas direct token prompting exhibits substantial systematic bias (Phan et al., 2024).
Mitigation via BPTree is practical for moderate vocabulary sizes and low-latency applications, but summation over large subword sets can present scalability challenges. The approach extends naturally to BPE as well as WordPiece tokenization.
6. Limitations, Assumptions, and Open Challenges
The causal estimation framework rests on several assumptions:
- The cutoff rank is exogenous, with no hidden confounding near the threshold.
- The potential outcomes are continuous in around .
- Experiments are limited to English text; extension to other scripts and languages with higher fertility remains necessary.
- Results pertain to local effects: the RD estimator captures average treatment effects for subwords near the vocabulary boundary, not global properties of the vocabulary.
- SUTVA may be violated if inclusion/exclusion of one subword affects the likelihood of others (especially nested or overlapping subwords).
- In observational regression frameworks (e.g., (Lundin et al., 5 Sep 2025)), unobserved confounders (training data volume, domain divergence, script effects) may bias causal estimates.
A plausible implication is that tokenisation bias may be somewhat mitigated by increasing model size or adopting morphologically aware tokenisers, but cannot be entirely eliminated without fundamental architectural shifts or token-free modeling paradigms.
7. Practical Implications and Recommendations
Tokenisation bias is a major determinant of both the output probabilities and practical deployment characteristics of LMs:
- Tokeniser choice alters the log-probabilities of character-strings by orders of magnitude, explaining phenomena such as length bias and accuracy gaps for morphologically rich or low-resource languages.
- Regression discontinuity estimators, as proposed by Lesci et al., allow accurate measurement of vocabulary design impacts without retraining (Lesci et al., 3 Jun 2025).
- For multilingual and low-resource scenarios, token inflation directly depresses accuracy and multiplies computational cost, underscoring the need for morphologically aware tokenisation and fair resource allocation (Lundin et al., 5 Sep 2025).
- Algorithmic post-hoc debiasing enables token-free sampling from standard LMs, improving evaluation fairness and adaptability to diverse tokenisation schemes (Phan et al., 2024).
The selection of vocabulary size, tokeniser architecture, and downstream evaluation benchmarks should be guided by explicit measurement of local treatment effects (e.g., via ). Extension of these methods to other languages, alternative model families, and study of global (rather than local) bias remains a key future direction.