Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

Published 7 Jun 2025 in cs.CL, cs.AI, and cs.LG | (2506.06607v1)

Abstract: We present a training-free method to transplant tokenizers in pretrained LLMs by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token's representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model's embedding space. On two challenging cross-tokenizer tasks--Llama$\to$Mistral NeMo (12B) and Qwen$\to$Llama (1B)--we show that OMP achieves best zero-shot preservation of the base model's performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a zero-shot tokenizer transplantation method via Orthogonal Matching Pursuit that re-aligns embeddings without additional training.
It demonstrates strong preservation of language model accuracy across benchmarks, especially in question answering and classification tasks.
The approach efficiently handles unseen tokens using sparse linear decompositions, making it suitable for real-world AI adaptations.

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

This paper introduces a novel approach to adapting a LLM to a new tokenizer without requiring additional training, using Orthogonal Matching Pursuit (OMP) to handle embedding realignments. The proposed method's zero-shot capability allows for cross-tokenizer experiments that preserve most of the base model's performance metrics without the traditional resource-intensive retraining processes typically required in such scenarios.

Introduction and Problem Statement

Tokenizers heavily influence the LLM's performance as they allocate fixed vocabularies for text processing. When models trained with one tokenizer are integrated into ecosystems utilizing different tokenizers, their performance can degrade significantly due to mismatches between token representation systems. Conventional approaches require either expensive retraining or use of zero-shot heuristics, which generally lead to substantial performance declines, especially in tasks requiring aligned token embeddings like question answering or mathematical reasoning.

OMP-Based Approach for Tokenizer Transplantation

By leveraging OMP, each token is represented as a sparse combination of shared token embeddings, refining this approach with mathematical rigor:

Shared Tokens: Directly copied from the base model's embeddings.
Unseen Tokens: Approximated using a sparse anchor set from shared tokens, avoiding retraining and maintaining weight consistency in the base embedding space.

In this methodology, even the unseen tokens receive embeddings derived through a weighted alignment in the base model's embedding space, thereby elegantly accommodating new vocabularies while preserving the model's integrity.

Figure 1: Sparse linear decompositions of selected tokens from Qwen 2.5's vocabulary. Each token is decomposed into a weighted sum of $k=8$ basis tokens, with coefficients colored according to magnitude (green for positive, red for negative).

Experimental Results

Llama $\to$ Mistral NeMo (12B)

OMP demonstrates its efficiency and effectiveness through cross-tokenizer experiments such as transplanting Llama's tokenizer into Mistral NeMo. The approach achieved a solid preservation of LLM accuracy across various benchmarks, particularly excelling in zero-shot settings compared to heuristic methods like zero or mean embedding initialization.

Qwen $\to$ Llama (1B)

Due to the high overlap in English tokens, transplanting Qwen's tokenizer into Llama displayed strong resilience, maintaining near-parity in performance on perplexity tasks and classification benchmarks. OMP's ability to bridge disparate tokenizer systems is highlighted by its adeptness in handling large vocabulary differences without additional fine-tuning.

Analysis of Trade-offs and Influences

A notable precision drop occurs when handling mathematical reasoning tasks across mismatched numerical tokenization schemes. This discrepancy is attributed to differing geometric representations of numeric tokens, impacting tasks involving arithmetic operations. However, within similarly matched tokenizer systems, mathematical performance is retained, indicating that the OMP methodology is effective where tokenization schemes align structurally.

Computational Efficiency and Practical Applications

The computational efficiency of OMP makes it highly applicable in real-world scenarios where token transplantation is required. For instance, within speculative decoding or domain-specific adaptations, OMP's training-free approach allows rapid deployment and integration:

Knowledge Distillation: Enables seamless teacher-student models with aligned token vocabularies.
Speculative Decoding: Facilitates interoperability in model pipelines without preliminary harmonization phases.

Conclusion

The training-free transplantation approach presented using OMP provides a robust alternative to conventional methods, defining new possibilities in model adaptation processes while effectively bridging gaps in vocabulary discrepancies. Its application extends into various AI domains, promising enhancements in both efficiency and deployment versatility. This research opens avenues for further exploration into other sparse coding techniques and suggests innovative modifications to handle inherently problematic tokenization mismatches.

The paper ultimately adds substantial value to the field by mitigating computational costs associated with traditional retraining and aligns seamlessly with future AI system development aimed at greater flexibility and interoperability.

Markdown Report Issue