Getting the most out of your tokenizer for pre-training and domain adaptation (2402.01035v2)

Published 1 Feb 2024 in cs.CL

Abstract: Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize tokenization. Moreover, the tokenizer is generally kept unchanged when fine-tuning a base model. In this paper, we show that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact the model's generation speed, effective context size, memory usage, and downstream performance. We train specialized Byte-Pair Encoding code tokenizers, and conduct extensive ablations on the impact of tokenizer design on the performance of LLMs for code generation tasks such as HumanEval and MBPP, and provide recommendations for tokenizer hyper-parameters selection and switching the tokenizer in a pre-trained LLM. We perform our experiments on models trained from scratch and from pre-trained models, verifying their applicability to a wide range of use-cases. We find that when fine-tuning on more than 50 billion tokens, we can specialize the tokenizer of a pre-trained LLM to obtain large gains in generation speed and effective context size.

PDF Abstract

Introduction

Tokenization, transforming text into tokens, is a critical yet often overlooked aspect of LLM development. The decision on how to tokenize a text holds considerable weight on the model's subsequent learning and generation capabilities, affecting aspects such as generation speed, effective context size, and memory usage. Despite its significance, the process is frequently not optimized or is left unchanged when fine-tuning a pre-trained model, which may not be ideal for specific domains or tasks.

Tokenization and its Impact

In this comprehensive analysis, the authors examine the impact of tokenizer characteristics—size, pre-tokenization regular expressions, and training data—on model performance, particularly for code generation tasks. The performance of specialized tokenizers for code, mainly Byte-Pair Encoding (BPE) variants, is thoroughly investigated through ablations studies.

Findings and Recommendations

The paper reveals significant insights:

Adjustments in pre-tokenization can substantially affect generation speed and compression metrics.
Vocabulary size, surprisingly, shows limited effect on coding performance, challenging the common belief that larger vocabularies might hinder a model's ability to generalize.
Fine-tuning existing models indicates that tokenizer can be modified with minimal impact on performance, provided the training involves sufficiently large datasets (over 50 billion tokens).

Tokenization Optimization Strategies

The paper proposes methods for increasing tokenizer compression while maintaining or enhancing downstream performance. Specifically:

Using in-domain data to train tokenizers can improve domain compression.
Regular expression-based pre-tokenization schemes can positively influence compression and performance. For instance, GPT-4's tokenizer, a variant with a well-balanced approach, stands out for striking a commendable balance between compression efficiency and downstream performance, closely followed by the Punct tokenizer variant.

Conclusion

The implications for practitioners are clear: re-evaluating the tokenizer when fine-tuning pre-trained LLMs can lead to significant performance gains without undue cost. The findings and methodologies laid out in this paper are critical stepping stones towards efficiently leveraging LLMs for domain-specific applications.