Papers
Topics
Authors
Recent
2000 character limit reached

Where is the signal in tokenization space?

Published 16 Aug 2024 in cs.CL and cs.LG | (2408.08541v2)

Abstract: LLMs are typically shipped with tokenizers that deterministically encode text into so-called canonical token sequences, to which the LLMs assign probability values. One common assumption is that the probability of a piece of text is the probability of its canonical token sequence. However, the tokenization of a string is not unique: e.g., the Llama2 tokenizer encodes Tokens as [Tok,ens], but [Tok,en,s] also represents the same text. In this paper, we study non-canonical tokenizations. We prove that, given a string, it is computationally hard to find the most likely tokenization for an autoregressive LLM, as well as to compute the marginal probability over all possible tokenizations. We then show how the marginal is, in most cases, indistinguishable from the canonical probability. Surprisingly, we then empirically demonstrate the existence of a significant amount of signal hidden within tokenization space. Notably, by simply aggregating the probabilities of non-canonical tokenizations, we achieve improvements across a range of LLM evaluation benchmarks for a variety of architectures, including transformers and state space models.

Citations (2)

Summary

  • The paper demonstrates that marginal probability estimates for non-canonical tokenizations approach canonical ones via importance sampling, highlighting the presence of hidden signal.
  • It introduces a sequential importance sampling estimator to efficiently approximate NP-hard tokenization space challenges.
  • Results indicate that leveraging non-canonical signals can enhance performance on benchmarks like HellaSwag and SocialIQA.

Overview

The paper "Where is the signal in tokenization space?" (2408.08541) explores the potential information carried by non-canonical tokenizations in autoregressive LLMs. Typically, LLMs assume a fixed canonical sequence for tokenization due to the deterministic nature of tokenizers like Byte-Pair Encoding (BPE). However, this paper investigates the possibility of extracting additional signal from the variety of non-canonical tokenizations, which are often ignored in practice.

Tokenization Challenges

Canonical vs. Non-Canonical Tokenizations

Tokenization is a fundamental step in processing text data for LLMs. It breaks down words into subwords which can be recombined during inference to produce meaningful outputs. Canonical tokenization relies on deterministic rules encoded within models like BPE, merging token pairs based on precompiled tables to form the so-called canonical sequence. Non-canonical tokenizations are alternative decompositions of text that are feasible under the same tokenizer schema, albeit less frequently used.

Computational Complexity

The paper provides a rigorous analysis showing that determining the most likely tokenization, or computing the marginal probability over all possible tokenizations for a given string using an autoregressive LLM, is computationally hard. Specifically, these problems are NP-hard and #P-hard, respectively. This insight underscores the difficulty in exploring tokenization space beyond canonical sequences. Given the exponential number of possible tokenizations, direct computation or enumeration of these alternatives is infeasible, pointing researchers and practitioners towards approximation strategies.

Methodology

Importance Sampling for Marginal Probability

To approximate the marginal probability over tokenizations, the authors implement a sequential importance sampling estimator using a one-step look-ahead distribution. This pragmatic approach adjusts probabilities based on the compatibility of the next token with the string, effectively pruning invalid tokenizations and concentrating sampling efforts on feasible candidates.

Empirical Analysis

Through empirical investigation using LLMs like Gemma-2B, Llama2-7B, and Mamba-130M, the paper reports that marginal probability estimates are generally close to canonical tokenization probabilities despite the substantial space of non-canonical possibilities. However, the existence of non-canonical tokenizations that carry significant signal for inference tasks was evident, leading to performance enhancements in multiple-choice question answering benchmarks.

Results and Implications

Improved Downstream Task Performance

The findings demonstrate that even minimal signal from non-canonical tokenizations can contribute to improved performance in downstream tasks such as question answering. By leveraging ensemble strategies that aggregate probabilities from these tokenizations, LLMs showed consistent improvements across benchmarks like HellaSwag and SocialIQA, compared to conventional approaches relying solely on canonical sequences.

Future Research Directions

The paper suggests promising pathways for further exploration into tokenization spaces, highlighting the potential utility of non-canonical sequences in enhancing LLM capabilities. This approach may redefine conventional views on model evaluation and inference efficiency, encouraging the development of models that can effectively harness non-canonical tokenizations.

Conclusion

In summary, while computational challenges limit direct use of non-canonical tokenizations, their potential contribution to improved model performance cannot be overlooked. The paper advocates for greater attention to tokenization diversity, urging advancements in approximation techniques and evaluation metrics that consider the broader signal present within tokenization spaces. This could lead to more robust and accurate LLMs capable of outperforming current standards with relatively straightforward modifications to inference strategies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 23 likes about this paper.