- The paper demonstrates that marginal probability estimates for non-canonical tokenizations approach canonical ones via importance sampling, highlighting the presence of hidden signal.
- It introduces a sequential importance sampling estimator to efficiently approximate NP-hard tokenization space challenges.
- Results indicate that leveraging non-canonical signals can enhance performance on benchmarks like HellaSwag and SocialIQA.
Overview
The paper "Where is the signal in tokenization space?" (2408.08541) explores the potential information carried by non-canonical tokenizations in autoregressive LLMs. Typically, LLMs assume a fixed canonical sequence for tokenization due to the deterministic nature of tokenizers like Byte-Pair Encoding (BPE). However, this paper investigates the possibility of extracting additional signal from the variety of non-canonical tokenizations, which are often ignored in practice.
Tokenization Challenges
Canonical vs. Non-Canonical Tokenizations
Tokenization is a fundamental step in processing text data for LLMs. It breaks down words into subwords which can be recombined during inference to produce meaningful outputs. Canonical tokenization relies on deterministic rules encoded within models like BPE, merging token pairs based on precompiled tables to form the so-called canonical sequence. Non-canonical tokenizations are alternative decompositions of text that are feasible under the same tokenizer schema, albeit less frequently used.
Computational Complexity
The paper provides a rigorous analysis showing that determining the most likely tokenization, or computing the marginal probability over all possible tokenizations for a given string using an autoregressive LLM, is computationally hard. Specifically, these problems are NP-hard and #P-hard, respectively. This insight underscores the difficulty in exploring tokenization space beyond canonical sequences. Given the exponential number of possible tokenizations, direct computation or enumeration of these alternatives is infeasible, pointing researchers and practitioners towards approximation strategies.
Methodology
Importance Sampling for Marginal Probability
To approximate the marginal probability over tokenizations, the authors implement a sequential importance sampling estimator using a one-step look-ahead distribution. This pragmatic approach adjusts probabilities based on the compatibility of the next token with the string, effectively pruning invalid tokenizations and concentrating sampling efforts on feasible candidates.
Empirical Analysis
Through empirical investigation using LLMs like Gemma-2B, Llama2-7B, and Mamba-130M, the paper reports that marginal probability estimates are generally close to canonical tokenization probabilities despite the substantial space of non-canonical possibilities. However, the existence of non-canonical tokenizations that carry significant signal for inference tasks was evident, leading to performance enhancements in multiple-choice question answering benchmarks.
Results and Implications
The findings demonstrate that even minimal signal from non-canonical tokenizations can contribute to improved performance in downstream tasks such as question answering. By leveraging ensemble strategies that aggregate probabilities from these tokenizations, LLMs showed consistent improvements across benchmarks like HellaSwag and SocialIQA, compared to conventional approaches relying solely on canonical sequences.
Future Research Directions
The paper suggests promising pathways for further exploration into tokenization spaces, highlighting the potential utility of non-canonical sequences in enhancing LLM capabilities. This approach may redefine conventional views on model evaluation and inference efficiency, encouraging the development of models that can effectively harness non-canonical tokenizations.
Conclusion
In summary, while computational challenges limit direct use of non-canonical tokenizations, their potential contribution to improved model performance cannot be overlooked. The paper advocates for greater attention to tokenization diversity, urging advancements in approximation techniques and evaluation metrics that consider the broader signal present within tokenization spaces. This could lead to more robust and accurate LLMs capable of outperforming current standards with relatively straightforward modifications to inference strategies.