Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles (2410.09303v1)

Published 11 Oct 2024 in cs.CL and cs.LG

Abstract: Tokenization is associated with many poorly understood shortcomings in LLMs (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing and comparing the stochastic behavior of tokenized models with their byte-level, or token-free, counterparts. We discover that, even when the two models are statistically equivalent, their predictive distributions over the next byte can be substantially different, a phenomenon we term as "tokenization bias''. To fully characterize this phenomenon, we introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution. From this result, we develop a next-byte sampling algorithm that eliminates tokenization bias without requiring further training or optimization. In other words, this enables zero-shot conversion of tokenized LMs into statistically equivalent token-free ones. We demonstrate its broad applicability with two use cases: fill-in-the-middle (FIM) tasks and model ensembles. In FIM tasks where input prompts may terminate mid-token, leading to out-of-distribution tokenization, our method mitigates performance degradation and achieves an approximately 18% improvement in FIM coding benchmarks, consistently outperforming the standard token healing fix. For model ensembles where each model employs a distinct vocabulary, our approach enables seamless integration, resulting in improved performance (up to 3.7%) over individual models across various standard baselines in reasoning, knowledge, and coding.

PDF HTML Abstract

An Analysis of Exact Byte-Level Probabilities in LLMs

The paper "Exact Byte-Level Probabilities from Tokenized LLMs for FIM-Tasks and Model Ensembles" addresses the inherent challenges posed by tokenization in LLMs (LMs) and proposes a method to mitigate the resulting tokenization bias. Tokenization, while crucial for scaling capabilities of transformers, introduces non-intuitive biases when predicting next tokens, especially in cases like fill-in-the-middle (FIM) tasks.

Key Concepts and Contributions

The authors introduce tokenization bias as the discrepancy between the predictive distributions of tokenized and byte-level LLMs. They identify that even when tokenized LMs and their byte-level counterparts are statistically equivalent in terms of training loss, their prediction outcomes can differ significantly.

Byte-Token Representation Lemma

Central to the paper is the Byte-Token Representation Lemma. This lemma provides a framework to compute exact byte-level probabilities from tokenized LLMs. The lemma asserts that any prefix of byte sequences is covered by a set of valid encodings, termed as cover encodings. By deriving a method to efficiently search for these cover encodings, the authors compute byte-level probabilities, providing a mechanism to transform any tokenized model to its byte-level equivalent without retraining.

Advanced Sampling Algorithm

The paper also presents an efficient sampling algorithm to predict next bytes directly from the token-level output. By leveraging the discovered cover encodings, the method synthesizes a zero-shot conversion from token to byte-level predictions, ensuring an unbiased inference process.

Empirical Findings

The practical implications of this work are demonstrated through notable improvements in FIM tasks and model ensemble scenarios:

Fill-in-the-Middle (FIM) Tasks:

The method substantially mitigates performance degradation when inputs terminate mid-token, which is a common scenario in coding benchmarks. The proposed approach achieves an 18% improvement over previous methods like token healing, demonstrating the efficiency of the bias correction method.

Model Ensembles:

By enabling predictions at the byte level, the paper facilitates seamless integration of models using different token vocabularies. This capability leads to a performance enhancement of up to 3.7% across varied LLM applications, showcasing the utility of the approach in real-world tasks.

Theoretical and Practical Implications

This research offers substantial theoretical insights into the nature and impact of tokenization within LMs. By formalizing the byte-token domain gap, it emphasizes the need to reconsider current tokenization strategies, especially as models scale to understand more contextual and nuanced human language constructs. Practically, the findings hold great promise for improving performance in tasks requiring accurate FIM predictions and robust model ensembling techniques without the overhead of additional training.

Future Developments

The paper opens intriguing pathways for future AI research, such as exploring byte-level ensembling or further refining the integration of multiple models with disparate tokenization. Extending this framework can lead to better abstractions for handling language tokens and bytes, potentially influencing the design of more generalized and flexible LMs.

In conclusion, this paper methodically uncovers and addresses the latent biases of tokenization, presenting a refined approach to ensure unbiased predictions at the byte level. This holds potential not only for enhancing current LLM applications but also for setting the groundwork for future advancements in AI language processing.