Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 24 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 434 tok/s Pro
Kimi K2 198 tok/s Pro
2000 character limit reached

Visual Token Merge Methods

Updated 8 July 2025
  • Visual Token Merge is a technique that combines redundant tokens into fewer, more informative ones to optimize self-attention in Vision Transformers.
  • It employs both learned and similarity-based merging strategies, using spatial cues and metric-based matching to preserve essential image details.
  • The approach enables efficient image classification, segmentation, and multimodal tasks by reducing FLOPs and memory footprint with minimal accuracy loss.

Visual token merge refers to a family of approaches in which multiple visual tokens—the basic vector representations corresponding to patches in an image or elements in other visual modalities—are combined or "merged" into fewer, more informative tokens at intermediate stages of a Vision Transformer (ViT) or related architecture. This process aims to reduce the computational and memory cost of self-attention mechanisms without significantly diminishing model accuracy, and often improves performance for large-scale or resource-constrained deployments.

1. Fundamental Principles and Motivations

The primary motivation for visual token merging arises from the quadratic computational complexity of the self-attention operation in transformers: handling NN tokens involves O(N2)O(N^2) operations per layer. In practice, input images (or videos, documents, etc.) are split into regularly spaced patches or tokens, many of which contain redundant or less informative data—typically backgrounds or repeated textures. Merging such tokens early or mid-way through the network maintains overall information content while mitigating computational bottlenecks (Bolya et al., 2022).

Two major design paradigms have emerged:

The process may target generic visual content or be specialized for modalities such as video (Lee et al., 31 Oct 2024), segmentation (Norouzi et al., 14 Jun 2024), vision-LLMs (Cao et al., 2023), documents (Zhai et al., 2023), or multimodal reasoning (Fu et al., 30 Dec 2024, Pippi et al., 6 Mar 2025).

2. Core Algorithms and Module Designs

The precise merging operation varies, but most implementations share technical similarities:

  • Token Similarity Computation: The similarity, most often cosine similarity or dot-product in the embedding space of the token (usually after normalization or projection), is computed between each pair or selected pairs of tokens. Some methods also use learnable embeddings specifically for merging (Lee et al., 13 Dec 2024).
  • Matching and Assignment: Algorithms select which tokens to merge based on:
    • Bipartite Soft Matching (BSM): Partition tokens into two sets and match similar pairs across the sets (Bolya et al., 2022, Tran et al., 25 May 2024).
    • Clustering: Group tokens into clusters based on density or distance (e.g., kNN-DPC (Zeng et al., 2022)).
    • Windowed/local merging: Merge tokens only within local spatial neighborhoods, especially in early layers (e.g., CLAP in ALGM (Norouzi et al., 14 Jun 2024), WiCo (Li et al., 5 Apr 2025)).
    • Energy-based selection: Compute a redundancy "energy score" over a neighborhood graph to protect outlier/unique tokens (Tran et al., 25 May 2024).
    • Saliency or importance guidance: Prioritize tokens for merging/pruning based on downstream task relevance (e.g., text-guided importance in PuMer (Cao et al., 2023)).
  • Merging Operation: Once pairs/groups are selected:
    • Weighted Averaging: Combine feature vectors of merged tokens, often weighted by an importance or size factor, or—with corrections (e.g., proportional attention in ToMe (Bolya et al., 2022))—to preserve influence in subsequent layers.
    • Norm-Preserving or MLERP Merging: Adjust the merged vector's norm (magnitude) via a formula akin to spherical linear interpolation to avoid distributional drift (Kim et al., 2023).
    • Learnable Masking: Use mask weights output from an MLP to specify which token groups to merge and with what coefficients (Wang et al., 21 Jul 2024).
    • Concatenation for window-based methods: Concatenate features in a window, then project with an MLP (Li et al., 5 Apr 2025).
  • Layer Placement: Merging can occur once (midway, as in PatchMerger (Renggli et al., 2022)), multiple times (progressively, e.g., ToMe, DSM), or adaptively at different places for local/global merging (ALGM (Norouzi et al., 14 Jun 2024), SDTM (Fang et al., 16 May 2025)).
  • Adaptive/learnable scheduling: The ratio/threshold of merging may be set adaptively per layer, per input, or per training epoch, sometimes by analyzing statistics such as cosine similarity distributions (Norouzi et al., 14 Jun 2024, Fang et al., 16 May 2025).

3. Practical Applications Across Visual Tasks

Visual token merging has been exploited in a variety of tasks and architectures:

Application Domain Example Methods Distinct Features
Image Classification PatchMerger, ToMe, DSM, PiToMe, LTM Substantial FLOPs/runtime reduction (up to 50-70%) with minimal accuracy loss (Bolya et al., 2022, Tran et al., 25 May 2024, Wang et al., 21 Jul 2024).
Semantic Segmentation ALGM, Prune and Merge Two-stage local-global strategies for dense prediction, improved mIoU (Norouzi et al., 14 Jun 2024, Mao et al., 30 Mar 2025).
Video Understanding Learnable VTM, FrameFusion Region- and saliency-based merging, up to 84% memory savings, 6.9x throughput (Lee et al., 31 Oct 2024, Fu et al., 30 Dec 2024).
Document Understanding Fast-StrucTexT Modality-guided dynamic merging for multi-granularity content (Zhai et al., 2023).
Vision-Language PuMer, ToFu, WiCo Text-guided or adaptive merging to retain critical cross-modal signals (Cao et al., 2023, Pippi et al., 6 Mar 2025, Li et al., 5 Apr 2025).
Image Generation MergeVQ, SDTM Merging for efficient generative modeling, preserving details via smart recovery (MergeVQ) and stage-specific strategies (SDTM) (Li et al., 1 Apr 2025, Fang et al., 16 May 2025).

In all cases, models employing token merging have demonstrated substantial inference speedups (typically 1.6–2.2×\times, up to nearly 7×\times in video), dramatic reductions in FLOPs and memory, and only negligible drops (often <1%, sometimes negligible or even a gain) in top-1 classification accuracy, mIoU, or other quantitative task scores.

4. Trade-Offs, Limitations, and Design Considerations

Several empirical and theoretical findings have influenced best practices in token merging:

  • Placement and Timing: Merging too early (when patch features are low-level and less semantically meaningful) risks premature information loss (Heo et al., 2023, Renggli et al., 2022). Methods such as delayed spatial merging (DSM) and adaptively scheduled merging (SDTM) delay token reduction or perform it hierarchically (local first, global later) (Norouzi et al., 14 Jun 2024, Fang et al., 16 May 2025).
  • Balancing Aggressiveness and Accuracy: Reducing tokens too aggressively (e.g., dropping below 8–32 tokens) can degrade fine-grained detail and performance, especially in small models or in tasks requiring spatial precision (Renggli et al., 2022, Norouzi et al., 14 Jun 2024).
  • Spatial Awareness and Structure Preservation: Relying exclusively on visual similarity can result in merging spatially distant (but visually similar) tokens or mixing content from different semantic objects. Recent approaches (ToSA (Huang et al., 24 Jun 2025), ALGM (Norouzi et al., 14 Jun 2024)) incorporate spatial priors—using depth images or patch positions—to better preserve scene layout.
  • Task-Specific Customization: Some tasks (e.g., human-centric pose estimation in TCFormer (Zeng et al., 2022), document layout analysis in Fast-StrucTexT (Zhai et al., 2023)) require dedicated merging strategies (e.g., importance-guided clustering, modality-guided dynamic merging) to preserve subtleties unique to their data.
  • Learnability and Training: Merging can be:
    • Static/post-hoc: Applied at inference on frozen models (e.g., ToMe, DSM, PiToMe), supporting drop-in acceleration.
    • Trainable/end-to-end: Parameters for merging are learned jointly with the model (LTM-Transformer, PatchMerger).
    • Modular/fine-tunable: Approaches such as DTEM (Lee et al., 13 Dec 2024) learn dedicated embeddings for merging, decoupled from the main ViT features and trainable either standalone or with full network fine-tuning.
  • Distributional Shift Correction: Norm mismatches from naive averaging can result in drift in the token distribution; MLERP (Kim et al., 2023) and similar norm-preserving strategies mitigate this.

5. Recent Advances and Theoretical Understandings

Recent research (2023–2025) has produced several notable insights:

  • Energy-Based and Spectral Preservation: PiToMe introduces an "energy" score over a neighborhood graph to distinguish redundant regions from unique content, yielding improved spectral fidelity and state-of-the-art accuracy/efficiency trade-offs over previous bipartite matching approaches (Tran et al., 25 May 2024).
  • Spectrum Conservation: Theoretical results demonstrate that if token merging preserves the eigenvalue spectrum of the normalized Laplacian associated with the token interaction graph, then intrinsic structure and downstream performance are maintained (Tran et al., 25 May 2024).
  • Combinatorial and Adaptive Merging Strategies: SDTM for diffusion transformers applies structure-then-detail merging based on denoising priors, dynamically adjusting merging strategies as image features shift from global structure to fine detail during generation (Fang et al., 16 May 2025).
  • Spatially Aware and Modality Flexible Merging: ToSA integrates spatial information from depth data, employed more heavily in early layers, and transitions dynamically to semantic similarity in deeper layers, addressing the limitations of pure visual similarity especially for tasks involving object counting or spatial reasoning (Huang et al., 24 Jun 2025).
  • Unified Token Compression: The Prune and Merge approach uses learnable merge and reconstruct matrices to not only compress tokens but also enable their restoration, balancing efficiency and information preservation (Mao et al., 30 Mar 2025).

6. Methodological Comparisons

Visual token merge is frequently contrasted with token pruning (dropping tokens outright) and methods such as sequential pooling or dynamic attention. The hybrid "fusion" approaches (ToFu (Kim et al., 2023, Pippi et al., 6 Mar 2025)) combine both, using pruning in sensitive early layers and merging (possibly with norm correction) in later layers, achieving superior accuracy–efficiency trade-offs.

A representative summary table:

Method Category Core Principle Distinctive Features Example Papers
Pruning Discard tokens Fast, but may irreversibly lose information DynamicViT, PuMer
Merging Combine tokens Preserves info, reduces redundancy PatchMerger, ToMe, PiToMe
Clustering Group tokens Flexible, e.g., DPC-KNN for adaptive shapes TCFormer
Fusion/Hybrid Combine approach Pruning + Merging + Norm/Saliency correction ToFu, FrameFusion
Spatial/Structural Use extra spatial info Integrate spatial/structural priors ALGM, ToSA
Learned/IB-based Supervised mask Minimizes IB loss, adapts to downstream task LTM-Transformer

7. Outlook and Future Directions

Open directions in visual token merge research include:

Visual token merging has moved from an efficiency-oriented engineering solution to a rich research area involving algorithmic innovations, theoretical guarantees, and task-sensitive customization, and now supports efficient, scalable transformer inference across diverse domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)