Subword Tokenization in LLM Pretraining: Evaluating BPE and Unigram LM Methods
The paper "Byte Pair Encoding is Suboptimal for LLM Pretraining" by Kaj Bostrom and Greg Durrett addresses the critical task of subword tokenization in LLM (LM) pretraining, comparing the widely-used byte-pair encoding (BPE) method with the unigram LLM (LM) tokenization. Pretrained transformers have demonstrated significant efficacy in various natural language processing tasks, necessitating an exploration of different pretraining setups, particularly in regard to subword tokenization.
Tokenization Methods
Subword tokenization plays a pivotal role in the handling of the open vocabulary problem, allowing LLMs to break down rare words into subword units. BPE, one of several popular algorithms, constructs vocabularies by merging the most frequent adjacent characters or tokens based on bigram frequency. Alternatively, unigram LM tokenization operates on a pruning mechanism, fitting a unigram LM to the corpus and removing tokens based on rarity and model perplexity.
Comparative Analysis
The authors undertake a comparative analysis of BPE and unigram LM methods by examining their effect on tokenization quality and downstream task performance. A robust experiment design involves pretraining model architectures akin to ROBERTA-BASE on English and Japanese datasets. The unigram LM exhibits a stronger alignment with morphological structure, yielding tokens that reflect linguistic composition, such as prefixes and suffixes, more accurately than BPE.
Empirical Results
Quantitatively, the paper highlights that unigram LM tokenization improves downstream task performance across machine learning benchmarks, such as English MNLI, SQUAD, CoNLL NER, and Japanese TyDi QA datasets. Notably, models pretrained using unigram LM tokenization demonstrate a performance enhancement of up to 10% over BPE tokenization in Japanese QA tasks. This outlines the superior ability of unigram LM to manage vocabulary space and support better compositional subword embeddings.
Implications and Future Directions
The findings suggest that the choice of tokenization method contributes significant inductive bias, impacting the ultimate effectiveness of pretrained LLMs. Consequently, unigram LM tokenization may offer a more optimal approach for future models. Theoretical implications extend to modifying tokenization to better capture morphological structures, enhancing model generalization across typologically diverse languages.
For AI development, these insights foster improvements in model architecture regarding tokenization choices, potentially leading to more structured and efficient language representation. Future research could further probe tokenization effects in even more languages and broaden the examination to encompass novel tokenization techniques or hybrid approaches that integrate strengths from both BPE and unigram LM.