Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning (2505.09738v1)

Published 14 May 2025 in cs.CL and cs.AI

Abstract: Pretrained LLMs are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

PDF Abstract

The paper "Achieving Tokenizer Flexibility in LLMs through Heuristic Adaptation and Supertoken Learning" (Sharthak et al., 14 May 2025 ) addresses the significant challenge of "tokenizer lock-in" in LLMs. LLMs are typically trained with a fixed tokenizer, which can lead to inefficiencies like token fragmentation and poor performance when applied to text with different characteristics, such as multilingual data or specialized domains (code, math). Adapting an LLM to a new tokenizer traditionally requires extensive and computationally expensive continued pre-training or fine-tuning. The authors propose a framework called TokenAdapt to enable more efficient tokenizer transplantation.

TokenAdapt focuses on replacing the original tokenizer entirely with a new, target-specific one and effectively initializing the embeddings for tokens present in the new vocabulary but not in the original. This initialization is crucial for preserving the model's pre-trained knowledge and minimizing subsequent training requirements, ideally enabling good zero-shot performance.

The core of the TokenAdapt framework is a novel hybrid heuristic strategy for synthesizing the embeddings of these unique new tokens. For a new token string, its embedding is created by combining two estimates:

Local Heuristic (Compositional): This method breaks down the new token string using the original tokenizer. It then uses an external text embedding model (an "auxiliary embedding space") to calculate semantic similarities between the full new token string and each of its constituent sub-token strings generated by the old tokenizer. These similarities, combined with a length normalization factor for each sub-token, are used as weights to combine the original embeddings of the sub-tokens. The idea is to reconstruct the new token's meaning from its semantic parts as understood by the original tokenizer.
Global Heuristic (Neighborhood): This method uses the same external auxiliary embedding model to find the $k$ nearest neighbors of the new token string within the entire original vocabulary based on semantic similarity. It then weights the original embeddings of these neighboring tokens based on their similarity scores. This captures the new token's semantic relationships to existing tokens in the original embedding space.

The final embedding for a new token is a weighted combination of these local and global estimates, controlled by a hyperparameter $\lambda \in [0, 1]$ (Equation \ref{eq:hybrid_combination_detailed}). If only one heuristic yields valid results (e.g., the new token cannot be decomposed by the old tokenizer, or no sufficiently similar neighbors are found), that estimate is used alone. If neither is valid, the embedding is initialized randomly. An interesting empirical finding noted by the authors is that applying a minimum similarity threshold during the global heuristic calculation surprisingly increased perplexity, suggesting a more nuanced interaction in the embedding space than simple filtering allows.

The overall TokenAdapt workflow (Algorithm \ref{alg:tokenadapt_formal_final}) involves three phases:

Transfer Shared Tokens: Embeddings for tokens present in both the original and new vocabularies are directly copied.
Synthesize Unique Tokens: For each token unique to the new tokenizer, the hybrid local+global heuristic is applied to generate its embedding. These new embeddings are added to the new embedding matrix. This process handles both input and output embedding layers, respecting weight tying configurations if applicable.
Finalize Model: The model's original embedding layers are replaced with the newly constructed embedding matrix, and weight tying is re-applied if necessary.

In addition to TokenAdapt, the paper also explores a complementary approach to improve the new tokenizer itself: Supertoken Learning. This method aims to train tokenizers (specifically BPE) to learn multi-word units ("supertokens") to enhance compression and reduce fragmentation, particularly beneficial for complex languages or domains. The technique involves a probabilistic pre-tokenization step during training, where texts are stochastically chunked, and a special separator string is inserted between chunks (Algorithm \ref{alg:supertoken_training_concise_v2}). The tokenizer is then trained to split on this separator before applying standard byte-level or BPE merges, effectively encouraging merges within the probabilistically defined multi-word chunks.

The authors empirically validated TokenAdapt using Llama-3.2-3B and Qwen2.5-3B base models, transplanting their tokenizers to a standard target tokenizer (QTK-81K) and a custom-trained supertoken tokenizer (Adi-Bun-128K). Evaluation was performed using zero-shot perplexity on a diverse dataset covering English, Hindi, Code, Math, and Hinglish. Results (Table \ref{tab:method_comparison_ppl_ratio_final_with_bases}) show that TokenAdapt, particularly the hybrid version, significantly outperforms simple baselines like Mean Initialization and ReTok (Gu et al., 6 Oct 2024 ), as well as more complex methods like TransTokenizer (Remy et al., 8 Aug 2024 ), in terms of maintaining zero-shot performance (lower perplexity ratios). TokenAdapt hybrid achieved up to approximately a 2-fold improvement in overall perplexity ratio compared to ReTok across different base models and target tokenizers.

The supertoken learning approach also demonstrated practical effectiveness. Analysis of tokenization outputs (Figure \ref{fig:word_count_domains_small}, Table \ref{tab:tokenizer_compression_results_small}) showed that the trained supertoken tokenizer (Adi-Bun-128K) successfully incorporates a higher proportion of multi-word units and achieves better compression ratios compared to standard tokenizers across various domains, supporting the idea that learning larger semantic units can improve representational efficiency. Visual examples (Table \ref{tab:model_visuals}) further illustrate how supertoken tokenizers can segment text into fewer, longer tokens compared to standard BPE-based tokenizers like those in GPT-4o or Llama 3.

In summary, TokenAdapt offers a practical and computationally efficient method for overcoming tokenizer lock-in by providing a robust hybrid heuristic for initializing new token embeddings during tokenizer transplantation, enabling strong zero-shot performance. Complementarily, the supertoken learning technique provides a way to create more efficient target tokenizers that learn multi-word units. These contributions lower the barrier to adapting powerful pre-trained LLMs for use in diverse languages and specialized domains without the prohibitive cost of full continued pre-training.