Papers
Topics
Authors
Recent
Search
2000 character limit reached

Token Merging in Transformers

Updated 13 April 2026
  • Token merging is a technique that condenses redundant or similar tokens to reduce sequence length and computational cost.
  • It employs methods such as cosine similarity, clustering, and adaptive scheduling to maintain attention quality and scalability.
  • Applications span vision, language, time series, and multimodal tasks, achieving significant throughput gains with minimal accuracy loss.

Token merging is a class of architectural and inference-time techniques for reducing the effective sequence length in transformer and state-space models, primarily to accelerate computation and decrease memory and energy consumption. By adaptively collapsing redundant or semantically similar tokens into fewer representative tokens, these methods can cut the quadratic complexity of attention and enable efficient, scalable application to vision, language, sequence, and multimodal tasks with minimal accuracy loss. Token merging has rapidly evolved beyond early heuristic approaches, now encompassing energy- and importance-aware algorithms, integration with quantization and clustering, local and global policies, and domain-specific variants for dense prediction, time series, and code.

1. Mathematical Principles and Core Algorithms

Token merging operates by combining subsets of tokens in sequence models into single “super-tokens,” reducing sequence length and hence the cost of subsequent layers. The core components across most approaches are:

  • Similarity Computation: Tokens are mapped to an embedding or key space, and similarity is measured via cosine similarity, dot product, or other learned metrics. For example, ToMe merges tokens (i,j)(i, j) using Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j) where kik_i are normalized key vectors (Bolya et al., 2022).
  • Matching Strategy: Pairs or groups of similar tokens are determined via matching algorithms. Classic bipartite soft matching partitions tokens into two groups (e.g., “even” and “odd”) and merges only across groups, while more advanced approaches use hierarchical agglomerative clustering with single, complete, or average linkage to select merge pairs or clusters (Haurum et al., 2024).
  • Token Merge Operator: Fused tokens are computed as weighted averages, with options for norm-preserving spherical interpolation (e.g., MLERP in ToFu) or component-wise fusion (e.g., max-per-dimension in CubistMerge) (Kim et al., 2023, Gong et al., 26 Sep 2025).
  • Downstream Attention Update: Merged tokens often represent the union of several patches or positions. To account for their increased “mass,” attention softmax can be modified using proportional attention, i.e., A=softmax(QK/d+logs)A = \mathrm{softmax}(QK^{\top}/\sqrt{d} + \log s), where ss tracks the size or multiplicity of each token (Bolya et al., 2022).
  • Adaptive and Layerwise Schedules: Merging can be uniformly distributed across layers, concentrated in early/mid/late blocks, or dynamically adapted per layer or per input sequence via multi-objective optimization or importance signals (Erak et al., 11 Sep 2025, Wu et al., 2024).

2. Representative Methods and Algorithms

Several algorithmic approaches have been introduced and studied in the literature:

Method Merge Criterion Merge Grouping / Schedule Notable Feature
ToMe (Bolya et al., 2022) Cosine similarity (keys) Bipartite soft matching, fixed # per layer Fast, highly parallel, no training
ATC (Haurum et al., 2024) Cosine distance Agglomerative hierarchical clustering Superior at low keep rates
PiToMe (Tran et al., 2024) Cluster “energy” (graph) Preserves low-energy tokens, merges high-energy clusters Spectrum preservation, informative token retention
QuickMerge++ (Liu et al., 16 Aug 2025) Attention entropy Entropy-based budgeting, AR prior AR compatible, salience weighted
ALGM (Norouzi et al., 2024) Cosine similarity Two-stage: local early, global mid Semantic segmentation, adaptivity
CubistMerge (Gong et al., 26 Sep 2025) Local path graph 2D reduction, local bipartite Preserves strict spatial grid
DTEM (Lee et al., 2024) Learned decoupled embedding Differentiable relaxed matching End-to-end trainable grouping
MergeDNA (Li et al., 17 Nov 2025) Local window similarity Hierarchical, context-aware Dynamic tokenization for DNA

There exist numerous domain-specific variants, e.g., Co-Me for geometric transformers (confidence-guided) (Chen et al., 18 Nov 2025), A-ToMe for adjacent token merging in speech (Li et al., 2023), and VQ-integrated methods such as MergeVQ for masked image modeling (Li et al., 1 Apr 2025).

3. Integration with Model Architectures and Tasks

Token merging is applicable, with minimal modifications, to a broad range of model families and tasks:

  • Vision Transformers and Dense Tasks: Merging is used both for classification (e.g., ViT, DeiT), where global similarity suffices, and for dense prediction (segmentation, object detection), where spatial structure and local/semantic detail must be preserved (Norouzi et al., 2024, Montello et al., 2 Oct 2025).
  • Structured and Spatial ViTs: For models relying on windowed attention or 2D positional priors (e.g., Swin, SAM, DINOv3), spatially-structured merging (e.g., CubistMerge: row/column reduction, local matching) is necessary to ensure compatibility with window partitioning and grid-based biases (Gong et al., 26 Sep 2025).
  • Language and Code Models: In code, merging subtokens forming a semantic unit (BPE fragments of identifiers) by averaging or attention-weighted sum compresses sequence length without retraining the backbone (Saad et al., 19 Jul 2025).
  • Time Series, Genomics, and SSMs: Local window-constrained merging (e.g., in MergeDNA for DNA, or local/causal merging for time series) maintains linear complexity while providing learned, data-dependent compression (Götz et al., 2024, Li et al., 17 Nov 2025, Park et al., 19 Aug 2025).
  • Autoregressive and Diffusion Models: Importance-guided merging (via classifier-free guidance in diffusion, or attention entropy in AR transformers) allows dynamic token budgeting and consistent generation quality (Wu et al., 2024, Liu et al., 16 Aug 2025).
  • Semantic Communication: Token merging with layerwise, Pareto-optimized budgets enables runtime adaptation to system constraints, such as wireless SNR, and efficient on-device inference (Erak et al., 11 Sep 2025).

4. Empirical Benefits and Performance Trade-offs

Token merging consistently delivers substantial reductions in computational cost, memory usage, and latency, with minimal or even positive effects on task performance:

  • Throughput Gains: ToMe achieves 2×2\times2.2×2.2\times throughput on ViT-L and ViT-H, and 1.7×1.7\times on DeiT-S, with <0.5%<0.5\% accuracy loss (Bolya et al., 2022). ATC offers superior accuracy retention at high merge rates (e.g., +9.6+9.6pp over ToMe at Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)0 keep on NABirds) (Haurum et al., 2024).
  • Semantic Segmentation Speedup: ALGM improves throughput up to Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)1 on ADE20K, with Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)2 mIoU and adaptive trade-off (Norouzi et al., 2024). Segformer++ achieves Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)3–Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)4 speedup on Cityscapes at Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)5 mIoU drop (Kienzle et al., 2024).
  • Fine Detail Preservation: IBTM outperforms ToMeSD in image/video generation, preserving high-information regions and improving FID and LPIPS metrics, especially under aggressive token reduction (e.g., FID Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)6 at Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)7) (Wu et al., 2024).
  • Domain-Specific Benefits: In ASR, A-ToMe reduces token count by Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)8, Sij=cos(ki,kj)S_{ij} = \cos(k_i, k_j)9 GPU speedup, and kik_i0 WER degradation (Li et al., 2023). MergeDNA achieves kik_i1 quadratic cost reduction and new SOTA on DNA benchmarks (Li et al., 17 Nov 2025). ClustViT yields kik_i2 fewer GFLOPs and kik_i3 faster segmentation with kik_i4 mIoU loss (Montello et al., 2 Oct 2025).

5. Domain and Task-Specific Variants

Recent research has proposed merging schemes tailored to the constraints of specific model classes:

  • SSM-Based Vision Models: MaMe exploits the SSM state-transition Δ as an informativeness measure, penalizing merges across highly informative tokens to maintain sequential modeling fidelity (Park et al., 19 Aug 2025).
  • Dense Prediction and Segmentation: Two-stage and semantically-supervised merges (ALGM, ClustViT) combine local or mask-guided clustering with unmerging or regeneration steps, reliably preserving boundary detail and spatial coverage while accelerating computation (Norouzi et al., 2024, Montello et al., 2 Oct 2025).
  • Spatial ViTs: CubistMerge enforces reduced token grids, local bipartite matching, and component-wise max fusion to maintain compatibility with windowed or RoPE-based models, enabling token count reduction without sacrificing positional bias or spatial structure (Gong et al., 26 Sep 2025).
  • Genomics and Long Sequences: MergeDNA stacks differentiable local merging layers to induce a data-driven, dynamic tokenizer, with joint sequence chunking and pretraining under merged token reconstruction and adaptive masking objectives (Li et al., 17 Nov 2025).

6. Limitations, Challenges, and Future Directions

While token merging is widely effective, several open issues and limitations remain:

7. Impact and Research Trajectory

Token merging has rapidly transitioned from a simple off-the-shelf plug-in (ToMe) to a versatile, domain-adaptive, and theoretically principled ecosystem of algorithms, with strong empirical utility across vision, language, speech, time series, genomics, and generative modeling. Research trends point toward:

This convergence of efficient inference, spectral/structural preservation, and downstream task robustness positions token merging as a central technology in the next generation of efficient transformer and sequence models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token Merging.