Zero-Shot Tokenizer Transfer (2405.07883v1)

Published 13 May 2024 in cs.CL

Abstract: LLMs (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.

PDF Abstract

Exploring Zero-Shot Tokenizer Transfer for LLMs

Introduction to Tokenizer Transfer Challenges

LLMs (LMs), those powerful tools behind many of today’s AI achievements, come with an inherent limitation tied to their tokenizer—the tool that breaks down text into manageable pieces called tokens. While effective, once an LM is trained with a specific tokenizer, using another can be akin to fitting a square peg in a round hole. Given LMs are often honed on English text, performance may suffer when the tokenizer deals with texts in other languages or specialized domains like programming.

Research has ventured into ways to adapt LMs to new tokenizers by tweaking the LM's embeddings—inputs that represent tokens. This process has generally involved continued training, which requires substantial data and time. Enter an intriguing proposition from a paper: Zero-Shot Tokenizer Transfer (ZeTT). This concept aims to equip an LM with the capacity to switch tokenizers on-the-fly without prior exposure to training data using the new tokenizer.

The Innovative Approach: Hypernetworks

To enable ZeTT, the researchers have introduced a novel approach involving a hypernetwork. This specialized neural network is trained to predict the necessary embeddings dynamically for any given tokenizer input. In practical terms, a hypernetwork decouples the LM from being bound to a specific tokenizer by learning to adapt its embeddings to optimize performance across various tokenizers and languages. This method saves on considerable re-training time and data resources typically associated with adapting LMs to new domains or languages.

Achieving Tokenizer Transfer: Impressive Results

The paper put this theory to the test through rigorous experiments involving common encoder and decoder-based LMs like XLM-R and Mistral-7B. The results were promising:

With adoption of the hypernetwork, they nearly matched the original LM's performance metrics with minimal accuracy loss across diverse linguistic tasks.
When adapting to language-specific tokenizers, such as from English to other languages, the models processed inputs more efficiently, achieving up to 16% faster inference times due to shorter tokenized sequences.
Minor continued training on fewer than 1 billion tokens could fully bridge any remaining performance gaps.

Practical Implications and Beyond

Beyond raw performance improvements, this research holds significant implications for the practical deployment of LMs:

Versatility: LMs can now adapt seamlessly to varied linguistic materials without extensive retraining.
Cost-efficiency: Reduced need for data and compute resources when extending model capabilities across tokenizers and languages.
Sustainability in AI: More efficient use of resources aligns with contemporary needs for environmentally and economically sustainable AI technologies.

Speculations on Future Developments

As AI continues to make inroads into global markets and multilingual environments, the flexibility offered by ZeTT could play a vital role in the universal deployment of LMs. Future enhancements might include refining hypernetwork capabilities to reduce even the minor discrepancies observed in tokenizer transfer. Furthermore, integration of this technology could greatly simplify model tuning for developers, opening new avenues for personalized and localized AI applications without intensive computational overheads.

In conclusion, the exploration of Zero-Shot Tokenizer Transfer represents a substantial step forward in our ongoing journey with AI and language understanding. It presents a glimpse into a future where LLMs transcend linguistic barriers effortlessly and more sustainably, empowered by advanced techniques like hypernetworks.