BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training (2409.04599v1)

Published 6 Sep 2024 in cs.CL

Abstract: LLMs can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce Picky BPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that our method does not reduce the downstream performance, and in several cases improves it.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Picky BPE, which integrates an Intersection over Self metric to refine vocabulary during tokenizer training.
It removes under-trained tokens while preserving or enhancing text compression and translation quality across various vocabulary sizes.
Experimental results demonstrate that Picky BPE maintains robust performance and scalability for diverse NLP applications.

Tokenization is a pivotal yet often understudied component impacting the efficiency and performance of LLMs. Byte-Pair Encoding (BPE) is widely utilized due to its simplicity and reliability. However, traditional BPE suffers from issues such as the presence of under-trained tokens and sub-optimal compression, which adversely affect downstream tasks. The paper "BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training" by Chizhov et al. introduces Picky BPE, an enhanced BPE variant designed to refine vocabulary during tokenizer training, addressing these shortcomings.

Methodological Innovation

The central innovation of Picky BPE is the integration of a stepwise vocabulary refinement process using an Intersection over Self (IoS) metric. The IoS metric quantifies the relative usage of a token within larger merges compared to its overall frequency, enabling the identification and elimination of intermediate tokens—tokens that are rarely used independently but frequently appear within larger tokens.

The IoS metric is formalized as: $\mathrm{IoS}(x_1\ |\ x_1, x_2) = \frac{f_p(x_1, x_2)}{f_t(x_1)}$

Here, $x_1$ and $x_2$ represent tokens being merged, $f_t$ is the token frequency, and $f_p$ is the pair frequency. If the IoS value exceeds a predefined threshold $\mathcal{T}$ , the token is classified as intermediate and considered for removal.

The algorithm proceeds as follows:

Initialization: The text is split into characters, initializing the vocabulary with unique symbols.
Iteration: The most frequent token pairs are iteratively merged. After each merge, the IoS metric is computed for the merged tokens.
Refinement: If the IoS value for a token surpasses the threshold $\mathcal{T}$ , the token is removed from the vocabulary.

Picky BPE also maintains an event order array, recording the sequence of merges and removals, thus ensuring consistency during the tokenization (inference) process.

Experimental Validation

The efficacy of Picky BPE was evaluated through a series of machine translation (MT) tasks: English to German (EN--DE), German to Estonian (DE--ET), and Ukrainian to Estonian (UK--ET). Various threshold values ( $\mathcal{T}$ ) were tested to determine their impact on downstream performance.

Translation Performance

The translation tasks demonstrated that Picky BPE achieves results comparable to or better than traditional BPE in conditions with constrained vocabulary sizes:

EN--DE Translation: For a vocabulary size of 8192, Picky BPE achieved similar or slightly better BLEU and COMET scores across different thresholds when compared to traditional BPE.
Large Vocabulary Sizes: For larger vocabularies (16384, 32768, and 65536), Picky BPE maintained its performance without degradation, suggesting robustness across different settings.

Token Quality and Compression

Evaluation metrics also included token frequency distribution, token length, and the proportion of word-initial tokens. Findings indicated that Picky BPE effectively removes low-frequency under-trained tokens, replacing them with higher-quality tokens while preserving or even improving text compression efficiency. For instance:

Low-frequency Token Removal: Removed tokens typically exhibited lower L2 norms, indicating under-training. Added tokens were more frequent and exhibited higher embedding norms.
Text Compression: Compression rates were either maintained or slightly improved with Picky BPE, contrary to some post-processing trimming methods that degraded compression.

Implications and Future Directions

The introduction of Picky BPE addresses key limitations of the traditional BPE method by refining vocabulary during training, mitigating the presence of under-trained tokens, and enhancing the quality and efficiency of token representation. This has significant implications for both practical applications and theoretical advancements in NLP:

Enhanced Safety and Reliability: By reducing under-trained tokens, Picky BPE contributes to safer and more reliable LLMs, potentially mitigating issues like hallucinations.
Scalability: Given its performance with larger vocabularies in constrained datasets, Picky BPE is promising for scaling to larger LLMs and more extensive training data regimes.

Future research could explore the integration of Picky BPE with various LLM architectures to further validate its utility and investigate additional downstream tasks beyond MT. Expanding evaluations to include a more diverse set of languages and scripts would also provide deeper insights into the generalizability of Picky BPE.

In conclusion, Picky BPE represents a methodologically sound and practically beneficial refinement to existing tokenization techniques, poised to contribute to the development of more efficient and effective NLP models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Dorialexander/status/1834287007629242821

https://twitter.com/kr0niker/status/1838104083854417958

https://twitter.com/AiEleuther/status/1937206312648999126

https://twitter.com/Dorialexander/status/1895021338684805405

https://twitter.com/syoyo/status/1845260157820993749

YouTube

Show All Videos