Token Merging in Transformers
- Token merging is a technique that condenses redundant or similar tokens to reduce sequence length and computational cost.
- It employs methods such as cosine similarity, clustering, and adaptive scheduling to maintain attention quality and scalability.
- Applications span vision, language, time series, and multimodal tasks, achieving significant throughput gains with minimal accuracy loss.
Token merging is a class of architectural and inference-time techniques for reducing the effective sequence length in transformer and state-space models, primarily to accelerate computation and decrease memory and energy consumption. By adaptively collapsing redundant or semantically similar tokens into fewer representative tokens, these methods can cut the quadratic complexity of attention and enable efficient, scalable application to vision, language, sequence, and multimodal tasks with minimal accuracy loss. Token merging has rapidly evolved beyond early heuristic approaches, now encompassing energy- and importance-aware algorithms, integration with quantization and clustering, local and global policies, and domain-specific variants for dense prediction, time series, and code.
1. Mathematical Principles and Core Algorithms
Token merging operates by combining subsets of tokens in sequence models into single “super-tokens,” reducing sequence length and hence the cost of subsequent layers. The core components across most approaches are:
- Similarity Computation: Tokens are mapped to an embedding or key space, and similarity is measured via cosine similarity, dot product, or other learned metrics. For example, ToMe merges tokens using where are normalized key vectors (Bolya et al., 2022).
- Matching Strategy: Pairs or groups of similar tokens are determined via matching algorithms. Classic bipartite soft matching partitions tokens into two groups (e.g., “even” and “odd”) and merges only across groups, while more advanced approaches use hierarchical agglomerative clustering with single, complete, or average linkage to select merge pairs or clusters (Haurum et al., 2024).
- Token Merge Operator: Fused tokens are computed as weighted averages, with options for norm-preserving spherical interpolation (e.g., MLERP in ToFu) or component-wise fusion (e.g., max-per-dimension in CubistMerge) (Kim et al., 2023, Gong et al., 26 Sep 2025).
- Downstream Attention Update: Merged tokens often represent the union of several patches or positions. To account for their increased “mass,” attention softmax can be modified using proportional attention, i.e., , where tracks the size or multiplicity of each token (Bolya et al., 2022).
- Adaptive and Layerwise Schedules: Merging can be uniformly distributed across layers, concentrated in early/mid/late blocks, or dynamically adapted per layer or per input sequence via multi-objective optimization or importance signals (Erak et al., 11 Sep 2025, Wu et al., 2024).
2. Representative Methods and Algorithms
Several algorithmic approaches have been introduced and studied in the literature:
| Method | Merge Criterion | Merge Grouping / Schedule | Notable Feature |
|---|---|---|---|
| ToMe (Bolya et al., 2022) | Cosine similarity (keys) | Bipartite soft matching, fixed # per layer | Fast, highly parallel, no training |
| ATC (Haurum et al., 2024) | Cosine distance | Agglomerative hierarchical clustering | Superior at low keep rates |
| PiToMe (Tran et al., 2024) | Cluster “energy” (graph) | Preserves low-energy tokens, merges high-energy clusters | Spectrum preservation, informative token retention |
| QuickMerge++ (Liu et al., 16 Aug 2025) | Attention entropy | Entropy-based budgeting, AR prior | AR compatible, salience weighted |
| ALGM (Norouzi et al., 2024) | Cosine similarity | Two-stage: local early, global mid | Semantic segmentation, adaptivity |
| CubistMerge (Gong et al., 26 Sep 2025) | Local path graph | 2D reduction, local bipartite | Preserves strict spatial grid |
| DTEM (Lee et al., 2024) | Learned decoupled embedding | Differentiable relaxed matching | End-to-end trainable grouping |
| MergeDNA (Li et al., 17 Nov 2025) | Local window similarity | Hierarchical, context-aware | Dynamic tokenization for DNA |
There exist numerous domain-specific variants, e.g., Co-Me for geometric transformers (confidence-guided) (Chen et al., 18 Nov 2025), A-ToMe for adjacent token merging in speech (Li et al., 2023), and VQ-integrated methods such as MergeVQ for masked image modeling (Li et al., 1 Apr 2025).
3. Integration with Model Architectures and Tasks
Token merging is applicable, with minimal modifications, to a broad range of model families and tasks:
- Vision Transformers and Dense Tasks: Merging is used both for classification (e.g., ViT, DeiT), where global similarity suffices, and for dense prediction (segmentation, object detection), where spatial structure and local/semantic detail must be preserved (Norouzi et al., 2024, Montello et al., 2 Oct 2025).
- Structured and Spatial ViTs: For models relying on windowed attention or 2D positional priors (e.g., Swin, SAM, DINOv3), spatially-structured merging (e.g., CubistMerge: row/column reduction, local matching) is necessary to ensure compatibility with window partitioning and grid-based biases (Gong et al., 26 Sep 2025).
- Language and Code Models: In code, merging subtokens forming a semantic unit (BPE fragments of identifiers) by averaging or attention-weighted sum compresses sequence length without retraining the backbone (Saad et al., 19 Jul 2025).
- Time Series, Genomics, and SSMs: Local window-constrained merging (e.g., in MergeDNA for DNA, or local/causal merging for time series) maintains linear complexity while providing learned, data-dependent compression (Götz et al., 2024, Li et al., 17 Nov 2025, Park et al., 19 Aug 2025).
- Autoregressive and Diffusion Models: Importance-guided merging (via classifier-free guidance in diffusion, or attention entropy in AR transformers) allows dynamic token budgeting and consistent generation quality (Wu et al., 2024, Liu et al., 16 Aug 2025).
- Semantic Communication: Token merging with layerwise, Pareto-optimized budgets enables runtime adaptation to system constraints, such as wireless SNR, and efficient on-device inference (Erak et al., 11 Sep 2025).
4. Empirical Benefits and Performance Trade-offs
Token merging consistently delivers substantial reductions in computational cost, memory usage, and latency, with minimal or even positive effects on task performance:
- Throughput Gains: ToMe achieves – throughput on ViT-L and ViT-H, and on DeiT-S, with accuracy loss (Bolya et al., 2022). ATC offers superior accuracy retention at high merge rates (e.g., pp over ToMe at 0 keep on NABirds) (Haurum et al., 2024).
- Semantic Segmentation Speedup: ALGM improves throughput up to 1 on ADE20K, with 2 mIoU and adaptive trade-off (Norouzi et al., 2024). Segformer++ achieves 3–4 speedup on Cityscapes at 5 mIoU drop (Kienzle et al., 2024).
- Fine Detail Preservation: IBTM outperforms ToMeSD in image/video generation, preserving high-information regions and improving FID and LPIPS metrics, especially under aggressive token reduction (e.g., FID 6 at 7) (Wu et al., 2024).
- Domain-Specific Benefits: In ASR, A-ToMe reduces token count by 8, 9 GPU speedup, and 0 WER degradation (Li et al., 2023). MergeDNA achieves 1 quadratic cost reduction and new SOTA on DNA benchmarks (Li et al., 17 Nov 2025). ClustViT yields 2 fewer GFLOPs and 3 faster segmentation with 4 mIoU loss (Montello et al., 2 Oct 2025).
5. Domain and Task-Specific Variants
Recent research has proposed merging schemes tailored to the constraints of specific model classes:
- SSM-Based Vision Models: MaMe exploits the SSM state-transition Δ as an informativeness measure, penalizing merges across highly informative tokens to maintain sequential modeling fidelity (Park et al., 19 Aug 2025).
- Dense Prediction and Segmentation: Two-stage and semantically-supervised merges (ALGM, ClustViT) combine local or mask-guided clustering with unmerging or regeneration steps, reliably preserving boundary detail and spatial coverage while accelerating computation (Norouzi et al., 2024, Montello et al., 2 Oct 2025).
- Spatial ViTs: CubistMerge enforces reduced token grids, local bipartite matching, and component-wise max fusion to maintain compatibility with windowed or RoPE-based models, enabling token count reduction without sacrificing positional bias or spatial structure (Gong et al., 26 Sep 2025).
- Genomics and Long Sequences: MergeDNA stacks differentiable local merging layers to induce a data-driven, dynamic tokenizer, with joint sequence chunking and pretraining under merged token reconstruction and adaptive masking objectives (Li et al., 17 Nov 2025).
6. Limitations, Challenges, and Future Directions
While token merging is widely effective, several open issues and limitations remain:
- Information Loss at High Merge Rates: Aggressive merging can induce noticeable performance degradation, especially in tasks requiring fine-grained details (e.g., detailed reconstruction, dense segmentation). Spectral analyses (PiToMe, MergeDNA) and norm-preserving fusion (MLERP) aim to mitigate this (Tran et al., 2024, Li et al., 17 Nov 2025, Kim et al., 2023).
- Compatibility with Non-Standard Architectures: Some spatial ViT variants or autoregressive decoders require specially structured merging to maintain attention mask or spatial layout invariants (Gong et al., 26 Sep 2025, Liu et al., 16 Aug 2025).
- Dynamic or Learned Budgets: Most methods use static schedules or hyperparameters; active research investigates Bayesian optimization of per-layer budgets (edge ViTs, semantic communication) and adaptive policies based on input statistics (entropy, attention, importance) (Erak et al., 11 Sep 2025, Liu et al., 16 Aug 2025, Wu et al., 2024).
- Extensibility to Arbitrary Modalities: Existing merging relies primarily on similarity in key or embedding space; several works propose integrating geometric, spatial, or downstream-task priors (ToSA with spatial tokens; ClustViT with pseudo-cluster labels) (Huang et al., 24 Jun 2025, Montello et al., 2 Oct 2025).
7. Impact and Research Trajectory
Token merging has rapidly transitioned from a simple off-the-shelf plug-in (ToMe) to a versatile, domain-adaptive, and theoretically principled ecosystem of algorithms, with strong empirical utility across vision, language, speech, time series, genomics, and generative modeling. Research trends point toward:
- End-to-end differentiable and feature-decoupled merging (Lee et al., 2024)
- Energy and importance-aware policies preserving task-critical tokens (Tran et al., 2024, Wu et al., 2024)
- Integration with quantization, clustering, and task-specific regeneration (Li et al., 1 Apr 2025, Montello et al., 2 Oct 2025)
- Robustness to extreme compression and application to resource-constrained deployment (Erak et al., 11 Sep 2025Wu et al., 2024)
- Application to adaptive and hierarchical tokenization in non-canonical domains (Li et al., 17 Nov 2025Götz et al., 2024)
This convergence of efficient inference, spectral/structural preservation, and downstream task robustness positions token merging as a central technology in the next generation of efficient transformer and sequence models.