Cross-Tokenizer Distillation

Updated 22 May 2026

Cross-tokenizer distillation is a set of methods that transfer model knowledge between tokenizers with different vocabularies and segmentation schemes.
Techniques such as probabilistic realignment, optimal transport, and byte-level mapping overcome sequence misalignment and enable effective teacher–student knowledge transfer.
These approaches facilitate rapid domain adaptation, model ensembling, and compressed deployment while minimizing retraining costs and performance loss.

Cross-tokenizer distillation encompasses a growing family of methodologies designed to transfer model knowledge (logits, preferences, or internal states) between LLMs or tokenizers with incompatible vocabularies. Traditional distillation techniques assume identical tokenization across teacher and student, but modern architectures are increasingly heterogeneous in tokenizer design (vocabulary size, segmentation, or even modality composition), necessitating novel alignment and transfer paradigms. Recent approaches address this problem through explicit probabilistic realignment, optimal transport on logit distributions, span-based alignment, byte-level interfaces, and dynamic mapping. These methods enable efficient knowledge distillation, tokenizer transplantation, model ensembling, and deployment flexibility across model architectures and domain specializations.

1. Motivation and Problem Definition

Cross-tokenizer distillation targets the transfer of knowledge from a teacher model with tokenizer $T_{\text{teacher}}$ to a student model with tokenizer $T_{\text{student}}$ , where the vocabularies $\mathcal{V}_{\text{teacher}}$ and $\mathcal{V}_{\text{student}}$ are not aligned. Vocabulary misalignment arises both from non-overlapping units (BPE, unigram, or byte-level segmentations) and from different numerical, symbolic, or domain-specific tokenization schemes. The challenge is twofold: sequence misalignment (differing token boundaries and sequence lengths) and vocabulary misalignment (no one-to-one mapping between token ID spaces), both of which preclude classical token- or logit-level KL-divergence losses.

A central objective of cross-tokenizer distillation is to bridge this gap without introducing significant performance degradation or requiring large-scale retraining from scratch. Motivations include:

Rapid adaptation of pretrained models to new tokenizers or domains
Reducing deployment memory and latency via vocabulary trimming
Enabling teacher–student knowledge transfer, on-policy or off-policy, between architectural families
Facilitating ensembling, merging, and speculative decoding across tokenizer boundaries
Aligning model preference distributions in RLHF and DPO-style preference distillation

Empirical evidence demonstrates that naive approaches (zero-initialization, mean-initialization, or token overlap masking) fail in the presence of significant tokenization mismatch, especially for mathematical, code, or domain-specialized reasoning tasks (Goddard et al., 7 Jun 2025, Singh et al., 8 Apr 2026).

2. Methodological Foundations

Diverse strategies have been proposed to address cross-tokenizer alignment. They can be grouped by the principles that undergird their alignment and transfer mechanisms:

A. Embedding-Space Realignment

Orthogonal Matching Pursuit (OMP): Constructs unseen student token embeddings as sparse linear combinations of shared “anchor” token embeddings using OMP in the teacher’s embedding space, then directly projects coefficients into the student space. This nonparametric, zero-shot approach enables training-free tokenizer transplantation, achieving strong zero-shot preservation on reasoning and language tasks when tokenization schemes are numerically aligned (Goddard et al., 7 Jun 2025).

B. Probabilistic Realignment and Marginalization

Cross-Tokenizer Likelihood Scoring: Leverages the recursive structure of BPE algorithms to marginalize teacher token probabilities into student-token likelihoods. In the subset regime ( $\mathcal{V}_S \subseteq \mathcal{V}_T$ ), next-token probabilities are computed via a binary mapping matrix; in the general case, lossless recursion or efficient beam search yields next-token approximations (Phan et al., 16 Dec 2025).
Byte-Level Distillation: Projects the teacher’s output distribution into byte-level probabilities (shared across all tokenizers), attaches a shallow byte-level decoder to the student, and distills by minimizing byte-level KL divergence. This sidesteps the need for vocabulary alignment and shows competitive performance with more complex approaches (Singh et al., 8 Apr 2026).

C. Sequence and Distributional Alignment via Optimal Transport

Universal Logit Distillation (ULD): Formulates distillation as optimal transport between teacher and student logit distributions without explicit token alignment. The OT distance (under a uniform cost) between sorted probability vectors provides a tractable, efficient Wasserstein metric for large vocabularies (Boizard et al., 2024).
Multi-Level Optimal Transport (MultiLevelOT): Extends OT alignment to both token and sequence levels, assembling cost matrices on a truncated, sequence-aware token support. The Sinkhorn distance is used for sequence-level alignment, integrating both holistic and local distributional information (Cui et al., 2024).
CoT2Align: Introduces chain-of-thought data augmentation with OT-based sequence and layer-wise alignment to enforce reasoning-aware transfer, surpassing DSKD and ULD in reasoning-heavy tasks (Le et al., 24 Feb 2025).

D. Tokenizer-Agnostic Alignment via Span and Chunk Representations

Span Representation Alignment (SRA): Defines tokenizer-agnostic spans via LCS of token ending offsets, pools token representations within each span weighted by attention (center-of-mass), and applies pairwise-geometry and logit-alignment losses at the span level (Dao et al., 2 May 2026).
SimCT: Recovers lost supervision in on-policy distillation by constructing the minimal aligned units (jointly tokenizable substrings according to tokenizer boundary graphs), using these as the atomic comparison units for KL loss (Sun et al., 8 May 2026).
Reverse CALM: Pools teacher and student token log-probabilities into aligned byte-level “chunks” and applies a reverse-direction binary cross-entropy, stabilizing gradients and filtering noise during cross-architecture diffusion model distillation (Zhang et al., 29 Apr 2026).

E. Contextual and Dynamic Realignment

Contextual Dynamic Mapping (CDM): Applies entropy-weighted DTW for sequence alignment, dynamic vocabulary mapping based on edit distance, and dual-direction mapping for robust, context-aware transfer at the logit level (Chen et al., 16 Feb 2025).
Dual-Space Weighting and Time-Warped Alignment (DWA-KD): Combines asymmetric, entropy-based token weighting (prioritizing difficult or confident tokens) with banded Soft-DTW alignment of both embeddings and final hidden states, allowing robust alignment of lexical and semantic sequence structure (Vu et al., 25 Feb 2026).

F. Preference Distillation and Human Alignment

CTPD: Aligns teacher and student via character-level span projection, adapting token-level importance sampling (TIS-DPO) to the span level, and uses a teacher-anchored DPO loss. Importance weighting corrects for label noise and the variance inherent in preference datasets (Nguyen et al., 17 Jan 2026).

The following table summarizes salient properties:

Method	Alignment Unit	Realignment Principle
OMP	Embedding, token	Sparse projection, zero shot
ULD/MultiLevelOT	Token / sequence logits	Optimal transport (Wasserstein)
ALM, CTPD, SRA	Byte/character spans/chunks	Span / chunk aggregation
SimCT, Reverse CALM	Minimal aligned units, chunks	Graph-theoretic / byte-aligned
CDM, DWA-KD	Contextual (entropy-weighted)	DTW / Soft-DTW, importance weight
Byte-Level Distil	Byte	Mutual byte marginalization
CoT2Align	Sequence / layer-wise	OT plus CoT data augmentation

3. Algorithmic Implementations

Many cross-tokenizer distillation procedures feature modular steps: alignment, realignment/mapping, and loss accumulation. Key algorithmic considerations include:

Efficient alignment: LCS on token offsets for span construction (Dao et al., 2 May 2026), byte-aligned chunk partitioning (Zhang et al., 29 Apr 2026), or DTW with contextual entropy weights (Chen et al., 16 Feb 2025, Vu et al., 25 Feb 2026).
Explicit or implicit probabilistic projections: OMP coefficients into new embedding spaces (Goddard et al., 7 Jun 2025); BPE-induced aggregation matrices (Phan et al., 16 Dec 2025); beam search for byte-level marginalization (Singh et al., 8 Apr 2026).
Loss computation: OT distances (often reduced to ℓ₁ over sorted vectors) (Boizard et al., 2024), binarized chunky KL or BCE (Minixhofer et al., 25 Mar 2025, Zhang et al., 29 Apr 2026), DPO-style contrastive preference loss (Nguyen et al., 17 Jan 2026).
Training regimes: Training-free transplantation (OMP), LoRA/adapter-based fine-tuning (Singh et al., 8 Apr 2026), or full-model fine-tuning depending on student size (Vu et al., 25 Feb 2026).
Efficient computational kernels: Incremental QR in OMP for sublinear least-squares solves; vectorized sort and index-operations for OT (Goddard et al., 7 Jun 2025, Boizard et al., 2024).

Selected empirical results:

Task/Pair	Best Zero-Shot/Distillation Score	Reference
Llama→Mistral NeMo (12B) MMLU	62.22% (OMP), baseline 64.52%	(Goddard et al., 7 Jun 2025)
Qwen→Llama1B GSM8K	1.44% (OMP), baseline 6.75%	(Goddard et al., 7 Jun 2025)
Qwen2.5-14B→Llama-3.1-8B Preference (Avg)	67.42 (CTPD), SFT 64.54	(Nguyen et al., 17 Jan 2026)
Qwen2.5-1.5B: 32k-vocab trimmed, GSM8K	58.6% (PKL), SFT 54.0%, full 60.2%	(Phan et al., 16 Dec 2025)
Math reasoning, OpenMath2-Llama8B→Gemma2-2B	65.1% GSM8K (ALM+SFT)	(Minixhofer et al., 25 Mar 2025)
SRA (Qwen1.5→GPT2-120M, ROUGE-L)	17.97 (SRA), vs. 15.35 (DSKD)	(Dao et al., 2 May 2026)
CALM vs Reverse CALM (HumanEval)	43.90 (CALM), 49.39 (Reverse CALM)	(Zhang et al., 29 Apr 2026)

4. Failure Modes, Limitations, and Structural Requirements

Cross-tokenizer distillation methods are sensitive to structural mismatches:

Numerical tokenization mismatch: Divergent schemes (e.g., digit-by-digit vs. chunked representation) severely degrade performance in mathematical tasks. OMP, as an example, is incapable of reconstructing the requisite numeric subspace if the schemes differ (Goddard et al., 7 Jun 2025).
Supervisor signal dilution: OPD methods discarding all non-shared tokens silently lose up to 70% of teacher signal. Methods like SimCT recover this loss by exploiting the minimal jointly-tokenizable units (Sun et al., 8 May 2026).
Chunking granularity: Coarsening aligned units (e.g., longer spans than the minimal aligned units) blurs critical distinctions necessary for fine-grained supervision, empirically reducing distillation quality (Sun et al., 8 May 2026).
Domain adaptation: Performance degrades more noticeably in code generation and mathematical reasoning than in conversational or summarization tasks, emphasizing the necessity of structural alignment and reasoning-aware losses (Cui et al., 2024, Le et al., 24 Feb 2025).
Computational complexity: While OT solvers naively scale cubically in vocabulary size, closed-form ℓ₁-on-sorted algorithms and effective pruning (e.g., beam search) render real-world distillation tractable (Boizard et al., 2024, Phan et al., 16 Dec 2025).

5. Applications and Practical Integration

Cross-tokenizer distillation now underlies a range of LLM workflows:

Tokenizer transplantation and vocabulary expansion: OMP enables post hoc transplantation without weight retraining, facilitating rapid domain adaptation and support for new vocabularies (Goddard et al., 7 Jun 2025).
On-policy and off-policy distillation: Methods such as SimCT and ALM enable matching of teacher and student distributions under the student’s policy rollouts (Sun et al., 8 May 2026, Minixhofer et al., 25 Mar 2025).
Preference distillation and RLHF alignment: CTPD ushers in character-level span projections and teacher-anchored DPO objectives, allowing fine-grained transfer of human preference signals across token boundaries (Nguyen et al., 17 Jan 2026).
Model ensembling and merging: Aligning models to a common tokenizer supports inference-time ensemble averaging of logits or probabilities, boosting downstream metrics (Minixhofer et al., 25 Mar 2025).
Compressed deployment and edge adaptation: Tokenizer realignment and vocabulary trimming reduce memory footprint (e.g., by 9–13.5% in LM-head size) while preserving task accuracy (Phan et al., 16 Dec 2025).

These techniques have been integrated into open-source toolchains—for example, mergekit-tokensurgeon implements OMP-based transplantation (Goddard et al., 7 Jun 2025).

6. Comparative Analysis and Benchmarking

Recent empirical studies benchmark cross-tokenizer procedures on diverse teacher–student pairs, task domains, and architectures. Key findings include:

Tokenizer-agnostic approaches (SRA, MultiLevelOT, ULD) outperform classical OT or KL losses where large vocabulary mismatch exists (Cui et al., 2024, Dao et al., 2 May 2026, Boizard et al., 2024).
Training-free methods (OMP, embedding-projector hypernetworks) match or closely approach full retraining in most zero-shot NLP metrics, except where semantic or numerical segmentation diverges (Goddard et al., 7 Jun 2025, Minixhofer et al., 25 Mar 2025).
Span-based, entropy-weighted, and dynamic mapping (DWA-KD, CDM) produce further gains in summarization, code generation, and out-of-domain instruction following (Chen et al., 16 Feb 2025, Vu et al., 25 Feb 2026).
Byte-level interface methods set a new robustness baseline, especially in BPE→byte or subword→byte transfer scenarios; however, they often show underperformance on structured outputs (e.g., structured instructions, code) (Singh et al., 8 Apr 2026).

Robustness studies indicate that hybrid approaches (e.g., dual-teacher distillation with both same- and cross-tokenizer loss) give additive benefits and may close the gap to shared-tokenizer upper bounds (Chen et al., 16 Feb 2025, Minixhofer et al., 25 Mar 2025). Ablation studies reinforce the necessity of fine granularity in alignment units and confirm that each modular component adds isolated value.

7. Open Challenges and Future Directions

Despite demonstrable advances, cross-tokenizer distillation remains an open research area. Key challenges and frontiers include:

Optimal design of chunking and span alignment for non-whitespace, morphologically-rich or multilingual tokenizers (Minixhofer et al., 25 Mar 2025, Dao et al., 2 May 2026).
Recipe selection for highly heterogeneous teacher–student pairs and for extreme compression (e.g., sub-100M parameter students) (Boizard et al., 2024).
Automated loss balancing in multi-objective setups (e.g., ALM + SFT + hidden state alignment), with growing model families and data scaling (Minixhofer et al., 25 Mar 2025).
Theoretical understanding of information loss and supervision granularity in minimal-aligned-unit extraction (e.g., SimCT) (Sun et al., 8 May 2026).
Extension to modalities beyond language (e.g., visual-semantic tokenizers; see WinTok for hybrid pixel/semantic token alignment (Guo et al., 18 May 2026)), as well as architecture-agnostic transfer (e.g., text diffusion models (Zhang et al., 29 Apr 2026)).
Efficient, scalable OT solvers for very large, non-overlapping vocabularies without reduction to byte-level representations.

A plausible implication is that progress will continue to require merging advances in probabilistic marginalization, span and sequence alignment, and information-theoretic optimality, with an emphasis on computational tractability and data-efficient adaptation. Cross-pollination between language and multimodal tokenization schemes is anticipated to yield further methodological innovation.

For comprehensive algorithmic and implementation details, refer to "Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit" (Goddard et al., 7 Jun 2025), "Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching" (Minixhofer et al., 25 Mar 2025), and "Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping" (Chen et al., 16 Feb 2025) among others cited in this article.