Token-Level Heterogeneous-Vocab Ensembles

Updated 17 June 2026

The paper introduces token-level heterogeneous-vocab ensembles that integrate predictions from models with mismatched tokenizations, setting new state-of-the-art benchmarks in reasoning, mathematics, QA, and translation.
It employs diverse methodologies—including sparse vocabulary alignment, orthogonal mapping, top-k union, and adaptive gating—to aggregate token probabilities efficiently and accurately.
These ensembles mitigate early error compounding and high computational costs by selectively ensembling only critical tokens and using expert routing, thereby enhancing robustness across various NLP tasks.

Token-level heterogeneous-vocab ensembles are a family of inference-time algorithms and frameworks that enable the fine-grained ensembling of LLMs with differing tokenization schemes and vocabulary sets. These methods aggregate token-level probability distributions, exploiting the complementary strengths of multiple models while overcoming the limitations imposed by mismatched vocabularies. Their variants include direct probability averaging via vocabulary alignment, surface-form agreement strategies, top-k probability fusion, probabilistic mapping through shared semantic anchors, route-based expert selection, and selective/gated ensembling triggered only at information-critical positions. This approach has established new state-of-the-art performance ceilings on standard benchmarks, particularly in reasoning, mathematics, knowledge QA, and machine translation tasks (Yu et al., 2024, Xu et al., 2024, Yao et al., 2024, Huang et al., 2024, Wicks et al., 28 Feb 2025, Yun et al., 17 Oct 2025, Xiong et al., 8 Jan 2026).

1. Foundational Problems and Motivation

The motivation for token-level ensembling is twofold. First, token-level aggregation allows for immediate correction of errors, mitigating “snowballing” effects that arise if a model outputs an incorrect token early in a generation sequence. Second, the thriving open-source LLM community predominantly relies on models with divergent, non-aligned vocabularies, inhibiting naïve token-level ensemble methods that assume consistent subword tokenization across all models (Yu et al., 2024). This vocabulary heterogeneity historically confined ensembles to post-hoc output reranking or scoring, underutilizing the granular information present in models’ stepwise distributions.

Three primary technical challenges arise:

Vocabulary mismatch: Different LLMs tokenize the same surface string into different subword tokens, leading to incompatible probability vector spaces.
Event-space alignment: Token-level distributions are only comparable after being mapped or projected into a shared event space.
Efficiency: Full-vocabulary alignment is computationally prohibitive due to vocab sizes (32k–128k+ tokens per model).

2. Methodological Taxonomy

Several strategies have been proposed to realize token-level heterogeneous-vocab ensembling:

Vocabulary Alignment and Projection

Generation-as-Classification (GaC): Treats every next-token prediction as a classification over the union vocabulary $V^{U} = \bigcup_{i=1}^n V^i$ , and uses sparse mapping matrices $M^i$ to project each model’s local softmax outputs into this shared space. Probabilities are averaged and renormalized, enabling stepwise aggregation (Yu et al., 2024).
EVA: Learns an orthogonal Procrustes mapping between embedding spaces for overlapping tokens, creating a similarity matrix $W^{B \to A}$ to project non-pivot model logits into a pivot model’s vocabulary, followed by sparse alignment and confidence-based filtering (Xu et al., 2024).
Relative-Space Transformation (DeePEn): Utilizes a shared anchor set drawn from the intersection of model vocabularies, projecting each model’s probability distribution into a relative “anchor” space via cosine similarities. The relative representations are fused, then an inverse search finds a token distribution in a main model’s vocabulary whose anchor projection best matches the ensemble (Huang et al., 2024).

Agreement and Event-Space Consensus

Agreement-Based Ensembling (ABE): Formulates surface-form agreement as the central event space. Only string extensions $\Delta$ that can be realized as a single token in all models are considered, with ensemble scores determined over the cross-product of token surfaces via cube-pruned search (Wicks et al., 28 Feb 2025).
Top-k Union Ensembling (UniTE): Aggregates probabilities over the union of each model’s top-k candidate tokens at each step, using each model’s local tokenizer to approximate or decompose tokens that are not present. This limits computation to highly likely events (Yao et al., 2024).

Routing and Gated Composition

FusionRoute: Incorporates a lightweight router network that, at each step, assigns the token prediction task to the most appropriate expert LLM, then fuses the expert’s logits with a learned correction term via logit addition. Support for heterogeneous vocabularies is achieved by union mapping and appropriate projection (Xiong et al., 8 Jan 2026).
Selective (Key-Token/Gated) Ensembling: Instead of aggregating at every token, only “hard” tokens (with low model confidence or low consensus) trigger full ensembling, reducing redundant computation and maintaining high throughput (Yu et al., 2024, Yun et al., 17 Oct 2025). SAFE (Stable and Fast LLM Ensembling) further identifies tokenization mismatches (OOV-like tokens) and consensus positions to trigger ensembling only when necessary (Yun et al., 17 Oct 2025).

3. Mathematical Formulation and Algorithms

A representative summary table for main token-level heterogeneous-vocab ensembling frameworks:

Method	Event Space	Projection/Alignment	Aggregation
GaC	Union vocab	Sparse $M^i$ , $p^i_U$	Mean over $p^i_U$
EVA	Pivot vocab	Orthogonal mapping $W$	Average aligned $p_\ell$
DeePEn	Relative anchor	Anchor sim, inverse search	Average anchors, search main
ABE	Surface-forms	Cross-product cube prune	Max consensus $\Delta$
UniTE	Top-k union	Tokenizer proxy	Mean over normalized scores
FusionRoute	Union vocab	Expert logits lift, union	Logit addition + router LM
SAFE	Adaptive union	Tokenization OOV check	Conditional aggregation

Detailed algorithmic steps:

GaC: For $M^i$ 0 models, each computes $M^i$ 1. $M^i$ 2 is projected as $M^i$ 3 ( $M^i$ 4-dimensional). Ensembling is $M^i$ 5; decoding proceeds via argmax or sampling (Yu et al., 2024).
EVA: For each non-pivot model $M^i$ 6, align with $M^i$ 7 over the pivot vocab, then average only “faithful” models determined by an alignment-based indicator (Xu et al., 2024).
DeePEn: Each model’s $M^i$ 8 is projected to anchors as $M^i$ 9. Ensembling averages $W^{B \to A}$ 0; main model’s probability vector $W^{B \to A}$ 1 is inferred via gradient-based optimization to match relative anchor projection (Huang et al., 2024).
ABE: For each model, obtain top-N token candidates and their scores. Using a min-heap/cube-pruned search, select the highest-scoring consensus $W^{B \to A}$ 2 that is a realizable surface form for all models, enforcing sequencing via “stalling” for asynchrony (Wicks et al., 28 Feb 2025).
UniTE: Collect each model’s top-k tokens $W^{B \to A}$ 3 and associated probabilities. Construct the union $W^{B \to A}$ 4 and, for each token $W^{B \to A}$ 5, compute $W^{B \to A}$ 6. If $W^{B \to A}$ 7, use tokenizer proxies (Yao et al., 2024).
SAFE: For each predicted token, check for tokenization mismatch via OOV-like test across models, and measure consensus. Ensemble only when both criteria fail. Incorporates probability “sharpening” by consolidating split subwords (Yun et al., 17 Oct 2025).

4. Empirical Performance and Benchmarking

Token-level heterogeneous-vocab ensembles deliver consistent gains over single models and output-level ensembles across domains. Key metrics:

GaC: Multi-benchmark improvements (OpenChat+SOLAR: avg 55.63 → 59.66, Mixtral+Yi: 58.44 → 61.56, Qwen1.5-72B+Llama3-70B: 68.30 → 71.08). Gated key-token ensembling reduces wall-clock time by 30–50%, with only 6–10% ensemble steps required (Yu et al., 2024).
EVA: Arithmetic reasoning (GSM8K): +10.61 points over best single; machine translation (Zh→En BLEU): +1.98 over single; consistent performance on NLG and reasoning tasks. Filtering improves robustness by excluding outlier models at each step (Xu et al., 2024).
DeePEn: 2–3 point accuracy gains on MMLU, ARC-C, and GSM8K over strongest single or output-fusion baselines. Ablations reveal critical role of anchor normalization (Huang et al., 2024).
ABE: BLEU and COMET improvements in machine translation for both same- and different-vocab pairs (up to +2.7 BLEU for encoder-decoder+LLM ensembles). Cube-pruned search yields low search overhead with top-N=32–128 (Wicks et al., 28 Feb 2025).
UniTE: Outperforms full-vocab methods (e.g., on GSM8K: LLaMA3+Qwen2: +3.39%; three model: +4.21%), with per-step computation O(Mk), negligible compared with O(M|V|). Over 95% of true tokens present in top-10 union (Yao et al., 2024).
SAFE: Ensembles <20% of tokens, often 1–5% for math, yet outperforms per-token brute-force methods. E.g., MATH500: UniTE@every-token: 59.6% (OOV-induced degradation), SAFE+UniTE: 77.4% (+5.0 points over single) (Yun et al., 17 Oct 2025).
FusionRoute: Route-based framework achieves 56.6% average accuracy across Llama-3 benchmarks (vs. 53.6% fine-tuned baseline, 50.2% prior Collab), with ablations showing loss of 4–5 points if logit addition is ablated. Demonstrates scalability to multiple domain-specialized experts (Xiong et al., 8 Jan 2026).

5. Model Selection, Compatibility, and Failure Modes

Best-practice ensembles require compatibility across several axes:

Performance gap: Gains degrade if models are >10 percentage points apart in accuracy on a given task (empirically confirmed by UniTE and GaC) (Yao et al., 2024, Yu et al., 2024).
Vocabulary size: Vocab size (32k–128k) has negligible direct impact. The majority of generated tokens are “common” words shared identically across tokenizers (Yao et al., 2024, Xu et al., 2024).
Response style: Divergent styles (e.g., chain-of-thought vs. direct answer) cause misaligned output lengths, voting bias, or splitting of correct tokens among several subwords, requiring sharpening or special handling (Yao et al., 2024, Yun et al., 17 Oct 2025).

Recommended selection procedure: start with the highest-accuracy model for the target task, iteratively add models with ≤10% accuracy gap and ≤2× output-length ratio (Yao et al., 2024). SAFE identifies positions where consensus is high or tokenization mismatch is detected, skipping ensemble calculations to avoid degradation (Yun et al., 17 Oct 2025).

Failure modes:

Strong-weak model mixing can reduce overall accuracy (Yu et al., 2024).
Tokenization-induced “OOV-like” errors if insufficient event-space mapping is performed (Yun et al., 17 Oct 2025).
Nonuniform output style can cause early termination or stalling, particularly in ABE and output-alignment-based approaches (Wicks et al., 28 Feb 2025).
Cube-pruned search overhead in large-vocab or highly divergent model settings, although mitigated with open tokenizers.

6. Theoretical and Practical Efficiency

Token-level ensembling is subject to several computational trade-offs:

Full vocabulary alignment (e.g., GaC, DeePEn): O(M|V|) per token, but possible to parallelize. Prone to high memory and latency (>250 ms/token for GaC/DeePen on large models) (Yao et al., 2024).
Top-k union (UniTE): O(Mk) per step, where k ≈ 10–20 suffices to capture >95% of correct tokens. Yields latency ≈ 88 ms/token, only marginally slower than a single model (Yao et al., 2024).
SAFE and Key-token Gating: Ensemble triggered only where necessary, typically under 10% of steps; remaining steps require only single-model forward passes. Yields latency ≈ 60–88 ms/token (Yu et al., 2024, Yun et al., 17 Oct 2025).
Anchor-based mapping (DeePEn): Anchor matrices can be precomputed, but the search-based inverse step requires 5–10 gradient iterations per output token (Huang et al., 2024).
Cube-pruned ABE: Search heap operations constrained by a small N (beam width), limiting runtime overhead (Wicks et al., 28 Feb 2025).

Most training-free methods allow plug-and-play deployment across diverse LLMs, supporting vLLM, DeepSpeed, and quantized models, assuming tokenizers have broad coverage (open or byte-level) and no fundamental OOV restrictions.

7. Future Directions, Limitations, and Unsolved Problems

While heterogeneous-vocab token-level ensembles have overcome many prior limitations, open challenges remain:

Global quality/robustness: If one ensemble member is of much lower quality, its probabilities can unduly distort the aggregate output; more adaptive weighting or filtering remains an open area (Yun et al., 17 Oct 2025).
Generality to free-form tasks: Methods such as ABE have demonstrated success in machine translation (narrow output space), while broader question answering, summarization, or dialog remains more challenging due to sparser agreement and potential stalling (Wicks et al., 28 Feb 2025).
Sampling-based decoding: Most frameworks focus on greedy or top-p deterministic selection. Robust, fair multinomial sampling over the “agreed” event space is poorly understood and remains a future direction (Wicks et al., 28 Feb 2025, Yu et al., 2024).
Cache synchronization: Efficient KV-cache management in partial/gated ensembles is a practical complexity for frameworks such as SAFE (Yun et al., 17 Oct 2025).
Scaling to many-way ensembles: While algorithms generalize to k≫2 models, search and alignment costs grow, requiring advanced pruning or adaptive routing (as in FusionRoute).

In sum, token-level heterogeneous-vocab ensemble methods enable robust, stepwise aggregation of LLM predictions by constructing shared event spaces—either over vocabulary unions, semantic anchor projections, or surface-form agreement—and selecting adaptive points at which to ensemble based on confidence, consensus, and tokenization compatibility. These frameworks are empirically validated to break single-model ceilings and support high-throughput deployment with minimal or no additional training, and their further evolution will critically shape the collaborative utilization of diverse LLMs in open-source and industrial settings (Yu et al., 2024, Xu et al., 2024, Yun et al., 17 Oct 2025, Xiong et al., 8 Jan 2026, Yao et al., 2024, Huang et al., 2024, Wicks et al., 28 Feb 2025).