Dynamic Token Merging in MrT5

Updated 15 March 2026

The paper introduces a learnable deletion gate that prunes and merges tokens early in the encoder to enhance computational efficiency.
Dynamic Token Merging is applied at an optimal encoder layer to achieve up to 80% token deletion while balancing speed with minimal accuracy loss.
Empirical results demonstrate significant inference speedups and robust multilingual performance with negligible increases in cross-entropy penalties.

Dynamic Token Merging (MrT5) refers to a mechanism for improving computational efficiency in byte-level LLMs by aggressively reducing sequence length during model processing while retaining modeling fidelity. As implemented in MrT5 (MergeT5), this approach introduces a learnable token deletion (merging) gate into the encoder, dynamically removing tokens at an early encoder stage and merging their information into remaining tokens via multi-head attention. This methodology enables byte-level models to match—often surpass—subword-tokenized architectures in robustness and downstream performance, while recovering much of their computational efficiency.

1. Motivation: Efficiency–Robustness Tradeoff in LLM Tokenization

Subword tokenization, as employed by models like T5 and mT5, yields short input sequences (∼32k vocabulary), affording efficient Transformer computation and strong downstream task performance. However, such tokenization is brittle to character-level noise (e.g., spelling errors, adversarial edits) and imposes inconsistent compression penalties across scripts, especially disadvantaging morphologically rich or non-Latin languages. In contrast, byte-level models like ByT5 operate directly over byte streams (0–255), providing noise robustness and consistent coverage, but suffer from sequence inflation—up to 5× longer than subword input—resulting in pre-training and inference inefficiency: ByT5 incurs up to 10× longer inference runtimes than mT5 and ∼33% longer pre-training duration. MrT5 directly addresses this limitation, aiming to merge the efficient sequence compression of subword methods with the robustness of byte-level representations by dynamically pruning and merging tokens during model execution (Kallini et al., 2024).

2. MrT5 Encoder Architecture and Token Merging Mechanism

MrT5 adopts the ByT5 Small encoder–decoder configuration (12 encoder layers, 4 decoder layers, $d \approx 1472$ , $d_{\mathrm{ff}} \approx 3584$ ). The central innovation is the insertion of a deletion gate after a fixed early encoder layer ( $l=3$ by default), guided by learned token importance scores. The gate outputs scalar scores $\mathbf{G} \in \mathbb{R}^{N \times 1}$ for the $N$ input hidden states: $\mathbf{G} = k\,\sigma(\mathbf{H}_l \mathbf{W} + b \mathbf{1}_N), \quad k = -30$ with $\mathbf{W} \in \mathbb{R}^{d \times 1}$ , $b \in \mathbb{R}$ , and $\sigma$ the sigmoid function.

During training, "soft" deletion is realized by injecting $\mathbf{G}$ as a bias into self-attention logits of layer $l+1$ , nulling attention contributions from tokens targeted for deletion. At inference, token removal is "hard"—all $\mathbf{G}_i < k/2$ tokens are physically excluded, yielding a compact sequence of length $\lfloor (1-\delta)N \rfloor$ for subsequent layers. Information from deleted tokens is funneled to remaining ones through pre-pruning attention, thus effecting a "merging" of content.

Table: MrT5 Gate Placement Impact

Gate Layer	Effect on Loss (Given Deletion Ratio)
1–2	Sharp degradation
3–4	Nearly optimal loss
>4	Marginal improvement

Placing the deletion gate at layer 3 balances contextualization with maximized downstream compute savings (Kallini et al., 2024).

3. Training Objectives and Ratio Control

Pre-training employs the span corruption objective mirrored from ByT5/mT5: masking ~15% of the bytes in spans (avg. length 20), which are replaced by sentinels, then requiring the decoder to reconstruct the masked content. The main loss is the cross-entropy over masked bytes,

$\mathcal{L}_{\mathrm{CE}} = -\sum_{t \in \mathrm{masked}} \log p(\text{byte}_t \mid \text{context})$

supplemented by a gate-regularizer promoting greater deletion,

$\mathcal{L}_G = \frac{1}{N} \sum_{i=1}^N \mathbf{G}_i$

forming the total loss,

$\mathcal{L} = \mathcal{L}_{\mathrm{CE}} + \alpha\,\mathcal{L}_G$

where $\alpha \in [5\mathrm{e}{-}3,\,1.5\mathrm{e}{-}2]$ mediates the efficiency–fidelity trade-off. Optionally, a proportional controller tunes $\alpha$ online to maintain a target compression ratio.

4. Empirical Results: Pre-training Compression and Downstream Performance

MrT5 achieves substantial compression with minor performance degradation. On English C4:

Model	Deletion Rate	ΔCross-Entropy (nats)	Inference Speedup
ByT5	0%	0 (0.7805 baseline)	0%
MrT5 ( $\alpha$ =0.006)	~40%	+0.0095	22%
MrT5 ( $\alpha$ =0.008)	~57%	+0.0145	27.5%
MrT5 ( $\alpha$ =0.01)	~63%	+0.0195	29.6%
MrT5 ( $\alpha$ =0.014)	~80%	+0.0295	39.9%

On multilingual mC4, zero-shot English-only MrT5 ( $\alpha$ =0.01) deletes 63% of tokens in English (ΔCE +0.020 nats), 40–50% in Latin scripts with minimal loss, and only ~25% in Chinese—demonstrating adaptation to orthography. Multilingual training (across 15 languages, $\alpha$ =0.012) achieves uniform 50–65% deletion and negligible (<0.03 nats) cross-entropy increases (Kallini et al., 2024).

On XNLI (MNLI-finetuned, 15 languages), MrT5 reduces sequence length by 53% with a 38% runtime reduction. Accuracy on English increases by 2.4 points over ByT5 (78.88% vs. 76.47%), while multilingual average drops 1.7 points (49.63% vs. 51.34%). For character-level tasks (contextual spelling correction, word search), MrT5 provides 30–55% runtime speed-up with ≤2 point accuracy loss.

5. Analysis: Deletion Dynamics and Efficiency

MrT5's deletion gate demonstrates context-sensitive pruning: on 1,000 English sentences, per-sample loss increases are uncorrelated (r = −0.014) with deletion rate, compared to a positive correlation (r ≈ +0.30) for random deletion. This indicates adaptive, non-heuristic token selection. Moving the gate earlier than layer 3 is empirically detrimental, while later placements yield marginal gains.

With a deletion rate around 60% and gate at layer 3, total encoder–decoder compute is reduced by approximately 50%; even higher compression rates permit up to ~3× speed-up in principle. Information merging occurs via attention redistribution from deleted to retained tokens, preserving key context while discarding redundancy (Kallini et al., 2024).

Dynamic token merging as in MrT5 is architecturally distinct from BPE-style dynamic tokenization (Feher et al., 2024) and token selection methods such as QuickMerge++ (Liu et al., 16 Aug 2025):

Retrofitting models with batch-level dynamic BPE merging (Feher et al., 2024) achieves mean sequence length reductions of >20% across 14 languages at <2 point performance drop, leveraging pre-trained hypernetworks for embedding unseen merged tokens.
QuickMerge++ (Liu et al., 16 Aug 2025) applies attention-norm-driven token saliency estimation and entropy-based per-example budgets, followed by differentiable merging and (optionally) a lightweight autoregressive prior, yielding up to 4× speed-up and negligible quality loss. This generalizes to MrT5 by inserting an inference-time, entropy/norm-guided merger after the encoder—requiring no retraining of the backbone—and can further compress cross-attention or decoder KV caches.

A summary of comparative aspects:

Method	Deletion/Merge Criterion	Learnable?	Model Re-training
MrT5	Learned gate on context	Yes	Yes
Dyn. Tokenizer	BPE-style, batch statistics	No (per run)	No (adapter only)
QuickMerge++	Attention norm/entropy	No (inference)	Optionally AR prior

MrT5 is unique in integrating token merging into model training via a differentiable policy, whereas batch dynamic BPE and QuickMerge++ primarily operate at runtime.

7. Implications, Limitations, and Further Directions

MrT5 recovers a significant fraction of subword model efficiency (20–40% speed-ups pre-training, 30–55% on downstream tasks) while retaining the robustness and script invariance of byte-level modeling. Multilingual MrT5 automatically learns language-specific deletion rates reflecting script morphological complexity. Compute–accuracy trade-off is directly tunable via $\alpha$ .

A plausible implication is that dynamic, learned token merging enables byte-level models to serve as competitive, fairness-aligned universal encoders, closing the gap with subword-tokenized approaches especially for non-Latin languages. Limitations include possible downstream score deficit for highly truncated sequences, sensitivity to gate placement, and the model's reliance on the ability of attention layers to successfully merge information prior to deletion.

Dynamic token merging remains an active area, with variants such as QuickMerge++ suggesting potential for further efficiency gains and broader application across modalities via inference-time, attention-guided compression and integration with lightweight autoregressive priors (Liu et al., 16 Aug 2025). Cross-comparison with per-batch dynamic BPE merging (Feher et al., 2024) highlights complementary axes of compression—future directions could include synergistic integration or hierarchical compression schedules.

References

"MrT5: Dynamic Token Merging for Efficient Byte-level LLMs" (Kallini et al., 2024)
"Retrofitting LLMs with Dynamic Tokenization" (Feher et al., 2024)
"QuickMerge++: Fast Token Merging with Autoregressive Prior" (Liu et al., 16 Aug 2025)