Token Merging in Transformers

Updated 13 April 2026

Token merging is a technique that condenses redundant or similar tokens to reduce sequence length and computational cost.
It employs methods such as cosine similarity, clustering, and adaptive scheduling to maintain attention quality and scalability.
Applications span vision, language, time series, and multimodal tasks, achieving significant throughput gains with minimal accuracy loss.

Token merging is a class of architectural and inference-time techniques for reducing the effective sequence length in transformer and state-space models, primarily to accelerate computation and decrease memory and energy consumption. By adaptively collapsing redundant or semantically similar tokens into fewer representative tokens, these methods can cut the quadratic complexity of attention and enable efficient, scalable application to vision, language, sequence, and multimodal tasks with minimal accuracy loss. Token merging has rapidly evolved beyond early heuristic approaches, now encompassing energy- and importance-aware algorithms, integration with quantization and clustering, local and global policies, and domain-specific variants for dense prediction, time series, and code.

1. Mathematical Principles and Core Algorithms

Token merging operates by combining subsets of tokens in sequence models into single “super-tokens,” reducing sequence length and hence the cost of subsequent layers. The core components across most approaches are:

Similarity Computation: Tokens are mapped to an embedding or key space, and similarity is measured via cosine similarity, dot product, or other learned metrics. For example, ToMe merges tokens $(i, j)$ using $S_{ij} = \cos(k_i, k_j)$ where $k_i$ are normalized key vectors (Bolya et al., 2022).
Matching Strategy: Pairs or groups of similar tokens are determined via matching algorithms. Classic bipartite soft matching partitions tokens into two groups (e.g., “even” and “odd”) and merges only across groups, while more advanced approaches use hierarchical agglomerative clustering with single, complete, or average linkage to select merge pairs or clusters (Haurum et al., 2024).
Token Merge Operator: Fused tokens are computed as weighted averages, with options for norm-preserving spherical interpolation (e.g., MLERP in ToFu) or component-wise fusion (e.g., max-per-dimension in CubistMerge) (Kim et al., 2023, Gong et al., 26 Sep 2025).
Downstream Attention Update: Merged tokens often represent the union of several patches or positions. To account for their increased “mass,” attention softmax can be modified using proportional attention, i.e., $A = \mathrm{softmax}(QK^{\top}/\sqrt{d} + \log s)$ , where $s$ tracks the size or multiplicity of each token (Bolya et al., 2022).
Adaptive and Layerwise Schedules: Merging can be uniformly distributed across layers, concentrated in early/mid/late blocks, or dynamically adapted per layer or per input sequence via multi-objective optimization or importance signals (Erak et al., 11 Sep 2025, Wu et al., 2024).

2. Representative Methods and Algorithms

Several algorithmic approaches have been introduced and studied in the literature:

Method	Merge Criterion	Merge Grouping / Schedule	Notable Feature
ToMe (Bolya et al., 2022)	Cosine similarity (keys)	Bipartite soft matching, fixed # per layer	Fast, highly parallel, no training
ATC (Haurum et al., 2024)	Cosine distance	Agglomerative hierarchical clustering	Superior at low keep rates
PiToMe (Tran et al., 2024)	Cluster “energy” (graph)	Preserves low-energy tokens, merges high-energy clusters	Spectrum preservation, informative token retention
QuickMerge++ (Liu et al., 16 Aug 2025)	Attention entropy	Entropy-based budgeting, AR prior	AR compatible, salience weighted
ALGM (Norouzi et al., 2024)	Cosine similarity	Two-stage: local early, global mid	Semantic segmentation, adaptivity
CubistMerge (Gong et al., 26 Sep 2025)	Local path graph	2D reduction, local bipartite	Preserves strict spatial grid
DTEM (Lee et al., 2024)	Learned decoupled embedding	Differentiable relaxed matching	End-to-end trainable grouping
MergeDNA (Li et al., 17 Nov 2025)	Local window similarity	Hierarchical, context-aware	Dynamic tokenization for DNA

There exist numerous domain-specific variants, e.g., Co-Me for geometric transformers (confidence-guided) (Chen et al., 18 Nov 2025), A-ToMe for adjacent token merging in speech (Li et al., 2023), and VQ-integrated methods such as MergeVQ for masked image modeling (Li et al., 1 Apr 2025).

3. Integration with Model Architectures and Tasks

Token merging is applicable, with minimal modifications, to a broad range of model families and tasks:

Vision Transformers and Dense Tasks: Merging is used both for classification (e.g., ViT, DeiT), where global similarity suffices, and for dense prediction (segmentation, object detection), where spatial structure and local/semantic detail must be preserved (Norouzi et al., 2024, Montello et al., 2 Oct 2025).
Structured and Spatial ViTs: For models relying on windowed attention or 2D positional priors (e.g., Swin, SAM, DINOv3), spatially-structured merging (e.g., CubistMerge: row/column reduction, local matching) is necessary to ensure compatibility with window partitioning and grid-based biases (Gong et al., 26 Sep 2025).
Language and Code Models: In code, merging subtokens forming a semantic unit (BPE fragments of identifiers) by averaging or attention-weighted sum compresses sequence length without retraining the backbone (Saad et al., 19 Jul 2025).
Time Series, Genomics, and SSMs: Local window-constrained merging (e.g., in MergeDNA for DNA, or local/causal merging for time series) maintains linear complexity while providing learned, data-dependent compression (Götz et al., 2024, Li et al., 17 Nov 2025, Park et al., 19 Aug 2025).
Autoregressive and Diffusion Models: Importance-guided merging (via classifier-free guidance in diffusion, or attention entropy in AR transformers) allows dynamic token budgeting and consistent generation quality (Wu et al., 2024, Liu et al., 16 Aug 2025).
Semantic Communication: Token merging with layerwise, Pareto-optimized budgets enables runtime adaptation to system constraints, such as wireless SNR, and efficient on-device inference (Erak et al., 11 Sep 2025).

4. Empirical Benefits and Performance Trade-offs

Token merging consistently delivers substantial reductions in computational cost, memory usage, and latency, with minimal or even positive effects on task performance:

Throughput Gains: ToMe achieves $2\times$ – $2.2\times$ throughput on ViT-L and ViT-H, and $1.7\times$ on DeiT-S, with $<0.5\%$ accuracy loss (Bolya et al., 2022). ATC offers superior accuracy retention at high merge rates (e.g., $+9.6$ pp over ToMe at $S_{ij} = \cos(k_i, k_j)$ 0 keep on NABirds) (Haurum et al., 2024).
Semantic Segmentation Speedup: ALGM improves throughput up to $S_{ij} = \cos(k_i, k_j)$ 1 on ADE20K, with $S_{ij} = \cos(k_i, k_j)$ 2 mIoU and adaptive trade-off (Norouzi et al., 2024). Segformer++ achieves $S_{ij} = \cos(k_i, k_j)$ 3– $S_{ij} = \cos(k_i, k_j)$ 4 speedup on Cityscapes at $S_{ij} = \cos(k_i, k_j)$ 5 mIoU drop (Kienzle et al., 2024).
Fine Detail Preservation: IBTM outperforms ToMeSD in image/video generation, preserving high-information regions and improving FID and LPIPS metrics, especially under aggressive token reduction (e.g., FID $S_{ij} = \cos(k_i, k_j)$ 6 at $S_{ij} = \cos(k_i, k_j)$ 7) (Wu et al., 2024).
Domain-Specific Benefits: In ASR, A-ToMe reduces token count by $S_{ij} = \cos(k_i, k_j)$ 8, $S_{ij} = \cos(k_i, k_j)$ 9 GPU speedup, and $k_i$ 0 WER degradation (Li et al., 2023). MergeDNA achieves $k_i$ 1 quadratic cost reduction and new SOTA on DNA benchmarks (Li et al., 17 Nov 2025). ClustViT yields $k_i$ 2 fewer GFLOPs and $k_i$ 3 faster segmentation with $k_i$ 4 mIoU loss (Montello et al., 2 Oct 2025).

5. Domain and Task-Specific Variants

Recent research has proposed merging schemes tailored to the constraints of specific model classes:

SSM-Based Vision Models: MaMe exploits the SSM state-transition Δ as an informativeness measure, penalizing merges across highly informative tokens to maintain sequential modeling fidelity (Park et al., 19 Aug 2025).
Dense Prediction and Segmentation: Two-stage and semantically-supervised merges (ALGM, ClustViT) combine local or mask-guided clustering with unmerging or regeneration steps, reliably preserving boundary detail and spatial coverage while accelerating computation (Norouzi et al., 2024, Montello et al., 2 Oct 2025).
Spatial ViTs: CubistMerge enforces reduced token grids, local bipartite matching, and component-wise max fusion to maintain compatibility with windowed or RoPE-based models, enabling token count reduction without sacrificing positional bias or spatial structure (Gong et al., 26 Sep 2025).
Genomics and Long Sequences: MergeDNA stacks differentiable local merging layers to induce a data-driven, dynamic tokenizer, with joint sequence chunking and pretraining under merged token reconstruction and adaptive masking objectives (Li et al., 17 Nov 2025).

6. Limitations, Challenges, and Future Directions

While token merging is widely effective, several open issues and limitations remain:

Information Loss at High Merge Rates: Aggressive merging can induce noticeable performance degradation, especially in tasks requiring fine-grained details (e.g., detailed reconstruction, dense segmentation). Spectral analyses (PiToMe, MergeDNA) and norm-preserving fusion (MLERP) aim to mitigate this (Tran et al., 2024, Li et al., 17 Nov 2025, Kim et al., 2023).
Compatibility with Non-Standard Architectures: Some spatial ViT variants or autoregressive decoders require specially structured merging to maintain attention mask or spatial layout invariants (Gong et al., 26 Sep 2025, Liu et al., 16 Aug 2025).
Dynamic or Learned Budgets: Most methods use static schedules or hyperparameters; active research investigates Bayesian optimization of per-layer budgets (edge ViTs, semantic communication) and adaptive policies based on input statistics (entropy, attention, importance) (Erak et al., 11 Sep 2025, Liu et al., 16 Aug 2025, Wu et al., 2024).
Extensibility to Arbitrary Modalities: Existing merging relies primarily on similarity in key or embedding space; several works propose integrating geometric, spatial, or downstream-task priors (ToSA with spatial tokens; ClustViT with pseudo-cluster labels) (Huang et al., 24 Jun 2025, Montello et al., 2 Oct 2025).

7. Impact and Research Trajectory

Token merging has rapidly transitioned from a simple off-the-shelf plug-in (ToMe) to a versatile, domain-adaptive, and theoretically principled ecosystem of algorithms, with strong empirical utility across vision, language, speech, time series, genomics, and generative modeling. Research trends point toward:

End-to-end differentiable and feature-decoupled merging (Lee et al., 2024)
Energy and importance-aware policies preserving task-critical tokens (Tran et al., 2024, Wu et al., 2024)
Integration with quantization, clustering, and task-specific regeneration (Li et al., 1 Apr 2025, Montello et al., 2 Oct 2025)
Robustness to extreme compression and application to resource-constrained deployment (Erak et al., 11 Sep 2025 Wu et al., 2024)
Application to adaptive and hierarchical tokenization in non-canonical domains (Li et al., 17 Nov 2025 Götz et al., 2024)

This convergence of efficient inference, spectral/structural preservation, and downstream task robustness positions token merging as a central technology in the next generation of efficient transformer and sequence models.