Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding and Mitigating Tokenization Bias in Language Models (2406.16829v2)

Published 24 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: State-of-the-art LLMs are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the LLMs for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any LLM trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized LLM. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the LLM.

Citations (1)

Summary

  • The paper identifies intrinsic next-character sampling bias, causing tokenized models to produce skewed conditional probability estimates.
  • It introduces a novel Branch and Pass algorithm that simulates token-free behavior from tokenized inputs with linear scalability.
  • Empirical validation in a Markov chain setting demonstrates the method’s effectiveness, enhancing language model robustness without fine-tuning.

Understanding and Mitigating Tokenization Bias in LLMs

The paper "Understanding and Mitigating Tokenization Bias in LLMs" explores the limitations and potential biases introduced by tokenization processes in contemporary LLMs, which typically operate on subword units called tokens. These LLMs, exemplified by architectures such as GPTs, Llama, and Gemini, rely heavily on efficient tokenization to manage vocabulary constraints and processing efficiency. Tokenization schemes, particularly those rooted in maximum prefix matching, induce a sampling bias on conditional probability estimates that cannot be simply resolved by augmenting training data or further training.

The authors identify a specific type of bias they refer to as next-character sampling bias. This arises when the probabilities for characters derived from tokenized inputs diverge from those probabilities derived natively at the character level. Notably, they demonstrate that this bias persists regardless of training improvements, highlighting a fundamental challenge intrinsic to the tokenization process itself.

To address this, the authors propose a novel algorithm aimed at correcting this bias without necessitating model fine-tuning. The proposed method, grounded in theoretical insights, estimates a token-free model's behavior from its tokenized counterpart, simulating, in effect, a LLM operating on characters. This process leverages an innovative Branch and Pass algorithm that incrementally computes character-level probabilities while maintaining the structure of tokens. This iterative approach scales linearly with the sequence length, affirming its practicality for real-world model deployments.

The implications of this research are significant. Practically, it allows for the use of tokenized models in contexts that would traditionally require token-free operations, thereby extending their applicability to domains poorly supported by existing tokenization methods. This is achieved without the complications of fine-tuning, a significant advantage given the computational and expertise barriers associated with model retraining. Theoretically, the work suggests that LLMs intrinsically capture character-level dependencies, despite being trained exclusively on token sequences.

This insight opens several avenues for future research. One potential direction is the exploration of tokenization schemes beyond maximum prefix matching, such as Byte-Pair Encoding (BPE). Additionally, considering the broader application of LLMs, this approach could enhance model interpretability by providing clearer insights into the implicit knowledge models gain about underlying character structures from tokenized inputs. Further inquiry could also extend to assessing this methodology's impact on models trained to distinguish nuanced or sensitive content, an area where tokenization biases could critically skew outcomes.

The paper's methodology was validated through experiments set in a Markov chain environment, where the proposed approach adeptly recovered transition probabilities, contrasting the skewed estimates typical of conventional direct token prompting methods. Such empirical validation not only underscores the effectiveness of the algorithm but also its potential as a tool for developing more robust LLMs in future AI research.

Youtube Logo Streamline Icon: https://streamlinehq.com