Dynamic Token Merging in MrT5
- The paper introduces a learnable deletion gate that prunes and merges tokens early in the encoder to enhance computational efficiency.
- Dynamic Token Merging is applied at an optimal encoder layer to achieve up to 80% token deletion while balancing speed with minimal accuracy loss.
- Empirical results demonstrate significant inference speedups and robust multilingual performance with negligible increases in cross-entropy penalties.
Dynamic Token Merging (MrT5) refers to a mechanism for improving computational efficiency in byte-level LLMs by aggressively reducing sequence length during model processing while retaining modeling fidelity. As implemented in MrT5 (MergeT5), this approach introduces a learnable token deletion (merging) gate into the encoder, dynamically removing tokens at an early encoder stage and merging their information into remaining tokens via multi-head attention. This methodology enables byte-level models to match—often surpass—subword-tokenized architectures in robustness and downstream performance, while recovering much of their computational efficiency.
1. Motivation: Efficiency–Robustness Tradeoff in LLM Tokenization
Subword tokenization, as employed by models like T5 and mT5, yields short input sequences (∼32k vocabulary), affording efficient Transformer computation and strong downstream task performance. However, such tokenization is brittle to character-level noise (e.g., spelling errors, adversarial edits) and imposes inconsistent compression penalties across scripts, especially disadvantaging morphologically rich or non-Latin languages. In contrast, byte-level models like ByT5 operate directly over byte streams (0–255), providing noise robustness and consistent coverage, but suffer from sequence inflation—up to 5× longer than subword input—resulting in pre-training and inference inefficiency: ByT5 incurs up to 10× longer inference runtimes than mT5 and ∼33% longer pre-training duration. MrT5 directly addresses this limitation, aiming to merge the efficient sequence compression of subword methods with the robustness of byte-level representations by dynamically pruning and merging tokens during model execution (Kallini et al., 2024).
2. MrT5 Encoder Architecture and Token Merging Mechanism
MrT5 adopts the ByT5 Small encoder–decoder configuration (12 encoder layers, 4 decoder layers, , ). The central innovation is the insertion of a deletion gate after a fixed early encoder layer ( by default), guided by learned token importance scores. The gate outputs scalar scores for the input hidden states: with , , and the sigmoid function.
During training, "soft" deletion is realized by injecting as a bias into self-attention logits of layer , nulling attention contributions from tokens targeted for deletion. At inference, token removal is "hard"—all tokens are physically excluded, yielding a compact sequence of length for subsequent layers. Information from deleted tokens is funneled to remaining ones through pre-pruning attention, thus effecting a "merging" of content.
Table: MrT5 Gate Placement Impact
| Gate Layer | Effect on Loss (Given Deletion Ratio) |
|---|---|
| 1–2 | Sharp degradation |
| 3–4 | Nearly optimal loss |
| >4 | Marginal improvement |
Placing the deletion gate at layer 3 balances contextualization with maximized downstream compute savings (Kallini et al., 2024).
3. Training Objectives and Ratio Control
Pre-training employs the span corruption objective mirrored from ByT5/mT5: masking ~15% of the bytes in spans (avg. length 20), which are replaced by sentinels, then requiring the decoder to reconstruct the masked content. The main loss is the cross-entropy over masked bytes,
supplemented by a gate-regularizer promoting greater deletion,
forming the total loss,
where mediates the efficiency–fidelity trade-off. Optionally, a proportional controller tunes online to maintain a target compression ratio.
4. Empirical Results: Pre-training Compression and Downstream Performance
MrT5 achieves substantial compression with minor performance degradation. On English C4:
| Model | Deletion Rate | ΔCross-Entropy (nats) | Inference Speedup |
|---|---|---|---|
| ByT5 | 0% | 0 (0.7805 baseline) | 0% |
| MrT5 (=0.006) | ~40% | +0.0095 | 22% |
| MrT5 (=0.008) | ~57% | +0.0145 | 27.5% |
| MrT5 (=0.01) | ~63% | +0.0195 | 29.6% |
| MrT5 (=0.014) | ~80% | +0.0295 | 39.9% |
On multilingual mC4, zero-shot English-only MrT5 (=0.01) deletes 63% of tokens in English (ΔCE +0.020 nats), 40–50% in Latin scripts with minimal loss, and only ~25% in Chinese—demonstrating adaptation to orthography. Multilingual training (across 15 languages, =0.012) achieves uniform 50–65% deletion and negligible (<0.03 nats) cross-entropy increases (Kallini et al., 2024).
On XNLI (MNLI-finetuned, 15 languages), MrT5 reduces sequence length by 53% with a 38% runtime reduction. Accuracy on English increases by 2.4 points over ByT5 (78.88% vs. 76.47%), while multilingual average drops 1.7 points (49.63% vs. 51.34%). For character-level tasks (contextual spelling correction, word search), MrT5 provides 30–55% runtime speed-up with ≤2 point accuracy loss.
5. Analysis: Deletion Dynamics and Efficiency
MrT5's deletion gate demonstrates context-sensitive pruning: on 1,000 English sentences, per-sample loss increases are uncorrelated (r = −0.014) with deletion rate, compared to a positive correlation (r ≈ +0.30) for random deletion. This indicates adaptive, non-heuristic token selection. Moving the gate earlier than layer 3 is empirically detrimental, while later placements yield marginal gains.
With a deletion rate around 60% and gate at layer 3, total encoder–decoder compute is reduced by approximately 50%; even higher compression rates permit up to ~3× speed-up in principle. Information merging occurs via attention redistribution from deleted to retained tokens, preserving key context while discarding redundancy (Kallini et al., 2024).
6. Relationship to Broader Dynamic Tokenization and Related Approaches
Dynamic token merging as in MrT5 is architecturally distinct from BPE-style dynamic tokenization (Feher et al., 2024) and token selection methods such as QuickMerge++ (Liu et al., 16 Aug 2025):
- Retrofitting models with batch-level dynamic BPE merging (Feher et al., 2024) achieves mean sequence length reductions of >20% across 14 languages at <2 point performance drop, leveraging pre-trained hypernetworks for embedding unseen merged tokens.
- QuickMerge++ (Liu et al., 16 Aug 2025) applies attention-norm-driven token saliency estimation and entropy-based per-example budgets, followed by differentiable merging and (optionally) a lightweight autoregressive prior, yielding up to 4× speed-up and negligible quality loss. This generalizes to MrT5 by inserting an inference-time, entropy/norm-guided merger after the encoder—requiring no retraining of the backbone—and can further compress cross-attention or decoder KV caches.
A summary of comparative aspects:
| Method | Deletion/Merge Criterion | Learnable? | Model Re-training |
|---|---|---|---|
| MrT5 | Learned gate on context | Yes | Yes |
| Dyn. Tokenizer | BPE-style, batch statistics | No (per run) | No (adapter only) |
| QuickMerge++ | Attention norm/entropy | No (inference) | Optionally AR prior |
MrT5 is unique in integrating token merging into model training via a differentiable policy, whereas batch dynamic BPE and QuickMerge++ primarily operate at runtime.
7. Implications, Limitations, and Further Directions
MrT5 recovers a significant fraction of subword model efficiency (20–40% speed-ups pre-training, 30–55% on downstream tasks) while retaining the robustness and script invariance of byte-level modeling. Multilingual MrT5 automatically learns language-specific deletion rates reflecting script morphological complexity. Compute–accuracy trade-off is directly tunable via .
A plausible implication is that dynamic, learned token merging enables byte-level models to serve as competitive, fairness-aligned universal encoders, closing the gap with subword-tokenized approaches especially for non-Latin languages. Limitations include possible downstream score deficit for highly truncated sequences, sensitivity to gate placement, and the model's reliance on the ability of attention layers to successfully merge information prior to deletion.
Dynamic token merging remains an active area, with variants such as QuickMerge++ suggesting potential for further efficiency gains and broader application across modalities via inference-time, attention-guided compression and integration with lightweight autoregressive priors (Liu et al., 16 Aug 2025). Cross-comparison with per-batch dynamic BPE merging (Feher et al., 2024) highlights complementary axes of compression—future directions could include synergistic integration or hierarchical compression schedules.
References
- "MrT5: Dynamic Token Merging for Efficient Byte-level LLMs" (Kallini et al., 2024)
- "Retrofitting LLMs with Dynamic Tokenization" (Feher et al., 2024)
- "QuickMerge++: Fast Token Merging with Autoregressive Prior" (Liu et al., 16 Aug 2025)