Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Byte Pair Encoding is Suboptimal for Language Model Pretraining (2004.03720v2)

Published 7 Apr 2020 in cs.CL

Abstract: The success of pretrained transformer LLMs (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram LLMing (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on LLM pretraining. We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE's greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked LLMs pretrained with these tokenizations. Across downstream tasks and two languages (English and Japanese), we find that the unigram LM tokenization method matches or outperforms BPE. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.

Subword Tokenization in LLM Pretraining: Evaluating BPE and Unigram LM Methods

The paper "Byte Pair Encoding is Suboptimal for LLM Pretraining" by Kaj Bostrom and Greg Durrett addresses the critical task of subword tokenization in LLM (LM) pretraining, comparing the widely-used byte-pair encoding (BPE) method with the unigram LLM (LM) tokenization. Pretrained transformers have demonstrated significant efficacy in various natural language processing tasks, necessitating an exploration of different pretraining setups, particularly in regard to subword tokenization.

Tokenization Methods

Subword tokenization plays a pivotal role in the handling of the open vocabulary problem, allowing LLMs to break down rare words into subword units. BPE, one of several popular algorithms, constructs vocabularies by merging the most frequent adjacent characters or tokens based on bigram frequency. Alternatively, unigram LM tokenization operates on a pruning mechanism, fitting a unigram LM to the corpus and removing tokens based on rarity and model perplexity.

Comparative Analysis

The authors undertake a comparative analysis of BPE and unigram LM methods by examining their effect on tokenization quality and downstream task performance. A robust experiment design involves pretraining model architectures akin to ROBERTA-BASE on English and Japanese datasets. The unigram LM exhibits a stronger alignment with morphological structure, yielding tokens that reflect linguistic composition, such as prefixes and suffixes, more accurately than BPE.

Empirical Results

Quantitatively, the paper highlights that unigram LM tokenization improves downstream task performance across machine learning benchmarks, such as English MNLI, SQUAD, CoNLL NER, and Japanese TyDi QA datasets. Notably, models pretrained using unigram LM tokenization demonstrate a performance enhancement of up to 10% over BPE tokenization in Japanese QA tasks. This outlines the superior ability of unigram LM to manage vocabulary space and support better compositional subword embeddings.

Implications and Future Directions

The findings suggest that the choice of tokenization method contributes significant inductive bias, impacting the ultimate effectiveness of pretrained LLMs. Consequently, unigram LM tokenization may offer a more optimal approach for future models. Theoretical implications extend to modifying tokenization to better capture morphological structures, enhancing model generalization across typologically diverse languages.

For AI development, these insights foster improvements in model architecture regarding tokenization choices, potentially leading to more structured and efficient language representation. Future research could further probe tokenization effects in even more languages and broaden the examination to encompass novel tokenization techniques or hybrid approaches that integrate strengths from both BPE and unigram LM.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Kaj Bostrom (7 papers)
  2. Greg Durrett (117 papers)
Citations (178)
X Twitter Logo Streamline Icon: https://streamlinehq.com