Tokenization Bias in Language Models
- Tokenization bias is a systematic disparity in how language models segment text into tokens, affecting performance and fairness.
- It arises from pre-tokenization rules and vocabulary selection choices, leading to over-fragmentation or under-fragmentation of text.
- Quantitative metrics and task-aware proxies demonstrate its impact, guiding practical recommendations for equitable model deployment.
Tokenization bias in LLMs refers to systematic and quantifiable disparities in how input text is segmented into tokens, resulting from the design of tokenization schemes, pre-tokenization rules, and vocabulary selection. Such bias directly impacts model efficiency, accuracy, fairness, and robustness—particularly for language varieties, rare forms, and morphologically complex or dialectal inputs. Tokenization bias is both a structural property of the tokenizer–language pair and a key confounder in the deployment of state-of-the-art LLMs, driving cost, performance, and downstream inclusiveness (Wegmann et al., 21 Feb 2025, Yang et al., 2024, Lesci et al., 3 Jun 2025).
1. Formal Properties of Tokenization and Pre-tokenizer Variants
Tokenization in LLMs is primarily operationalized via Byte Pair Encoding (BPE), WordPiece, or Unigram LM algorithms. BPE iteratively merges the most frequent symbol pairs in a corpus, forming a vocabulary of size . The initial token units and permissible merges are determined by a pre-tokenizer, which applies regular-expression–based partitioning to constrain merge boundaries.
The core pre-tokenizer variants explored are:
- no: No pre-tokenization, merges permitted everywhere.
- ws: Splits on every whitespace character.
- _ws: Groups whitespace runs, keeps leading space with tokens.
- llama3: Distinguishes Unicode categories (letters, numbers, punctuation), leveraging patterns for contractions and groupings.
- gpt2: Aggressively splits by category, retaining leading spaces, and differentiates “ word” from “word”.
These pre-tokenizers regulate which linguistic or orthographic features are preserved as atomic tokens, directly influencing bias for rare forms, dialects, or concatenative morphologies (Wegmann et al., 21 Feb 2025).
2. Measuring Tokenization Bias and Impact on Downstream Tasks
Tokenization bias emerges through both over-fragmentation (splitting rare or out-of-distribution forms into excessive subwords) and under-fragmentation (failure to segment morphologically complex forms appropriately). Standard metrics for quantifying this phenomenon include:
- Token Length Ratio:
- Fertility: Average tokens per word,
- Characters per Token (CPT), Vocabulary Allocation (AR), and Vocabulary Overlap (JSD)
Bias manifests differently for tasks:
- Semantic robustness (e.g., NLI, paraphrase): Penalized by fragmentation of rare/dialectal forms (“doin'” into “do” + “in”), best mitigated by tokenizers like GPT-2 pre-tokenizer with mid-size vocabularies (32k).
- Form-sensitive tasks (e.g., authorship, register, dialect ID): Benefit from tokenizers preserving orthographic signals (suffixes, contractions), best served by larger vocabularies (≥64k) and pure-letter splits (llama3/gpt2) (Wegmann et al., 21 Feb 2025).
A task-aware intrinsic proxy—logistic regression accuracy over binary token presence—achieves high correlation (≈0.86) with actual LLM fine-tuning performance, surpassing count-based proxies such as Rényi efficiency or raw corpus token count.
3. Quantitative Findings in Language Variation, Efficiency, and Fairness
Tokenization bias is a principal source of downstream inequity for less-represented linguistic forms and languages. Empirically:
- Pre-tokenizer choice encodes the greatest effect size, dominating corpus and vocabulary parameters (Wegmann et al., 21 Feb 2025).
- On robust semantic tasks, the best configuration (gpt2 pre-tokenizer at 32k vocab) achieves up to 8% gain over naïve tokenization; for style tasks, the llama3 pre-tokenizer with 64k vocab provides maximal sensitivity to orthographic detail.
- Twitter-trained tokenizers capture more orthographic/dialectal variance, helping style tasks but yielding near-tie on semantic ones.
- Without aggressive splitting, rare or dialectal forms suffer under-representation, manifesting as higher error rates and lower consistency.
Intrinsic metrics correlate poorly with BERT performance unless aligned to downstream task (e.g., logistic regression proxy).
4. Tokenization-Induced Bias: Mechanisms and Practical Implications
Tokenization settings modulate the propensity to break or preserve rare surface forms:
- Standard corpora will often split rare words—detrimental for semantic robustness—whereas larger vocabulary or aggressive splitting can maintain form signals.
- Pre-tokenizer design is paramount, as it defines the set of allowed merges and hence the granularity of the token vocabulary.
- Vocabulary size interacts with task needs; insufficient size fragments forms, while excessive size unnecessarily inflates model parameters.
Bias is further exacerbated if tokenizer fitting and LLM pre-training corpora are misaligned: rare language variation present in the downstream domain may not be atomic in the tokenizer, degrading model utility unless specifically addressed (Wegmann et al., 21 Feb 2025).
5. Recommendations for Mitigating Tokenization Bias
Research consensus supports these strategies for bias reduction:
- Pre-tokenizer selection: Choose pre-tokenization rules that reflect the expected language and orthographic diversity (gpt2/llama3 for pure-letter splits).
- Vocabulary size adjustment: Scale according to downstream sensitivity—larger vocabularies for style or form-based tasks, moderate sizes for semantic tasks.
- Proxy-guided selection: Run logistic regression probes using candidate vocabularies as binary features to estimate likely downstream performance, enabling early filtering of suboptimal tokenizers.
- Task-aligned fitting corpus: When corpus–domain shift is present, err toward larger, form-sensitive vocabularies and pre-tokenizers prioritizing orthographic forms.
These practical prescriptions are derived from systematic experimental evidence that aligns tokenization settings with task-specific model robustness, language coverage, and fairness constraints (Wegmann et al., 21 Feb 2025).
6. Broader Context and Limitations
Findings reflect that tokenization is not merely a preprocessing artifact but a locus of inductive bias with material impact on generalization, computational efficiency, and equitable language representation in LLMs. The effect generalizes across BPE and related subword algorithms. Task-aware proxies allow for rapid, low-compute evaluation of candidate tokenizers before large-scale pre-training. However, optimal settings remain domain- and task-dependent; no universal best tokenizer exists, and continued research is necessary to account for emerging language variation and data shifts.
In summary, explicit design and evaluation of tokenization are critical for fair, efficient, and robust LLM deployment—particularly as language variation, domain adaptation, and typological diversity grow increasingly central in real-world LLM applications (Wegmann et al., 21 Feb 2025).