Toward a Theory of Tokenization in LLMs (2404.08335v1)

Published 12 Apr 2024 in cs.CL and cs.LG

Abstract: While there has been a large body of research attempting to circumvent tokenization for LLMing (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant LLMs. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{{\text{th}}$-order} Markov processes for $k > 1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{{\text{th}}$-order} Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.

PDF Abstract

Uncovering the Value of Tokenization in Transformers for Modeling Markovian Data

Introduction

LLMs traditionally separate the processes of tokenization and neural network training, with tokenization serving as a critical preliminary step. This segmentation has provoked extensive research into tokenization's efficacy and its impact on LLM performance. This paper explores tokenization from a theoretical perspective, exploring its influence on transformer models when handling data from Markov processes. By examining both the necessity and effectiveness of tokenization, the paper provides a comprehensive analysis of its role in enhancing transformer-based LLMing.

Theoretical Investigation into Tokenization

The paper makes several key observations regarding the performance of transformers on Markovian data, highlighting the fundamental importance of tokenization. Key insights from the research include:

Empirical Observations: Transformers, when trained without tokenization on data originating from $k^{th}$ -order Markov processes, tend to model character predictions based on a unigram distribution, limiting their ability to capture the true data distribution efficiently. This limitation results in a higher cross-entropy loss compared to optimal models.
Impact of Tokenization: Introducing tokenization significantly improves transformers' ability to model Markovian data accurately. The paper shows that with appropriate tokenization, even simple unigram models over tokens can effectively approximate the probability of sequences, thereby achieving near-optimal cross-entropy loss.
Analysis of Tokenization Techniques: The research provides an in-depth analysis of various tokenization methods, including a theoretical examination of a toy tokenizer and practical tokenizers like LZW and BPE. It is demonstrated that tokenizers which efficiently capture patterns in the data allow for unigram models to achieve near-optimal modeling of sequence probabilities with much smaller dictionary sizes than the toy tokenizer.

Practical and Theoretical Implications

Implications for LLMing

The findings underscore the critical role of tokenization in the development of efficient LLMs, particularly in the context of transformer architectures. By facilitating a significant reduction in cross-entropy loss, tokenization enables transformers to model complex data distributions more accurately without requiring an increase in model complexity.

Insights into Tokenizer Efficiency

The analysis of different tokenization strategies reveals the efficiency of data-driven tokenizers (e.g., LZW and BPE) in achieving low cross-entropy loss with smaller dictionaries. This efficiency is especially pronounced when compared to the toy tokenizer, highlighting the importance of dictionary size and tokenization strategy in model performance.

Future Directions

The paper opens up several avenues for future research, including:

Exploration of Other Metrics: While the focus of this research is on cross-entropy loss, future work could explore tokenization's impact on other metrics such as BLEU or ROUGE, which are relevant to tasks like machine translation.
Finite Sample Considerations: Further investigation into the finite-sample behavior of transformers and the impact of tokenization on model training and generalization would be valuable.

Concluding Remarks

This paper provides a rigorous theoretical examination of tokenization's role in transformer-based LLMing, particularly when dealing with Markovian data. The research highlights the necessity of tokenization and its effectiveness in improving model performance, offering insights that could guide the future development of more efficient LLMs.