Visual Token Merge Methods

Updated 8 July 2025

Visual Token Merge is a technique that combines redundant tokens into fewer, more informative ones to optimize self-attention in Vision Transformers.
It employs both learned and similarity-based merging strategies, using spatial cues and metric-based matching to preserve essential image details.
The approach enables efficient image classification, segmentation, and multimodal tasks by reducing FLOPs and memory footprint with minimal accuracy loss.

Visual token merge refers to a family of approaches in which multiple visual tokens—the basic vector representations corresponding to patches in an image or elements in other visual modalities—are combined or "merged" into fewer, more informative tokens at intermediate stages of a Vision Transformer (ViT) or related architecture. This process aims to reduce the computational and memory cost of self-attention mechanisms without significantly diminishing model accuracy, and often improves performance for large-scale or resource-constrained deployments.

1. Fundamental Principles and Motivations

The primary motivation for visual token merging arises from the quadratic computational complexity of the self-attention operation in transformers: handling $N$ tokens involves $O(N^2)$ operations per layer. In practice, input images (or videos, documents, etc.) are split into regularly spaced patches or tokens, many of which contain redundant or less informative data—typically backgrounds or repeated textures. Merging such tokens early or mid-way through the network maintains overall information content while mitigating computational bottlenecks (2210.09461).

Two major design paradigms have emerged:

Learned merging, as in PatchMerger (2202.12015) or LTM-Transformer (2407.15219), where the loci and mechanism of merging are parameterized and learned during training.
Similarity-based, non-parametric merging (e.g., ToMe (2210.09461), DSM (2303.02331), PiToMe (2405.16148)), which exploits pre-existing token similarities to merge tokens in a deterministic or data-driven fashion—sometimes post hoc, without retraining.

The process may target generic visual content or be specialized for modalities such as video (2410.23782), segmentation (2406.09936), vision-LLMs (2305.17530), documents (2305.11392), or multimodal reasoning (2501.01986, 2503.04444).

2. Core Algorithms and Module Designs

The precise merging operation varies, but most implementations share technical similarities:

Token Similarity Computation: The similarity, most often cosine similarity or dot-product in the embedding space of the token (usually after normalization or projection), is computed between each pair or selected pairs of tokens. Some methods also use learnable embeddings specifically for merging (2412.10569).
Matching and Assignment: Algorithms select which tokens to merge based on:
- Bipartite Soft Matching (BSM): Partition tokens into two sets and match similar pairs across the sets (2210.09461, 2405.16148).
- Clustering: Group tokens into clusters based on density or distance (e.g., kNN-DPC (2204.08680)).
- Windowed/local merging: Merge tokens only within local spatial neighborhoods, especially in early layers (e.g., CLAP in ALGM (2406.09936), WiCo (2504.04024)).
- Energy-based selection: Compute a redundancy "energy score" over a neighborhood graph to protect outlier/unique tokens (2405.16148).
- Saliency or importance guidance: Prioritize tokens for merging/pruning based on downstream task relevance (e.g., text-guided importance in PuMer (2305.17530)).
Merging Operation: Once pairs/groups are selected:
- Weighted Averaging: Combine feature vectors of merged tokens, often weighted by an importance or size factor, or—with corrections (e.g., proportional attention in ToMe (2210.09461))—to preserve influence in subsequent layers.
- Norm-Preserving or MLERP Merging: Adjust the merged vector's norm (magnitude) via a formula akin to spherical linear interpolation to avoid distributional drift (2312.01026).
- Learnable Masking: Use mask weights output from an MLP to specify which token groups to merge and with what coefficients (2407.15219).
- Concatenation for window-based methods: Concatenate features in a window, then project with an MLP (2504.04024).
Layer Placement: Merging can occur once (midway, as in PatchMerger (2202.12015)), multiple times (progressively, e.g., ToMe, DSM), or adaptively at different places for local/global merging (ALGM (2406.09936), SDTM (2505.11707)).
Adaptive/learnable scheduling: The ratio/threshold of merging may be set adaptively per layer, per input, or per training epoch, sometimes by analyzing statistics such as cosine similarity distributions (2406.09936, 2505.11707).

3. Practical Applications Across Visual Tasks

Visual token merging has been exploited in a variety of tasks and architectures:

Application Domain	Example Methods	Distinct Features
Image Classification	PatchMerger, ToMe, DSM, PiToMe, LTM	Substantial FLOPs/runtime reduction (up to 50-70%) with minimal accuracy loss (2210.09461, 2405.16148, 2407.15219).
Semantic Segmentation	ALGM, Prune and Merge	Two-stage local-global strategies for dense prediction, improved mIoU (2406.09936, 2503.23455).
Video Understanding	Learnable VTM, FrameFusion	Region- and saliency-based merging, up to 84% memory savings, 6.9x throughput (2410.23782, 2501.01986).
Document Understanding	Fast-StrucTexT	Modality-guided dynamic merging for multi-granularity content (2305.11392).
Vision-Language	PuMer, ToFu, WiCo	Text-guided or adaptive merging to retain critical cross-modal signals (2305.17530, 2503.04444, 2504.04024).
Image Generation	MergeVQ, SDTM	Merging for efficient generative modeling, preserving details via smart recovery (MergeVQ) and stage-specific strategies (SDTM) (2504.00999, 2505.11707).

In all cases, models employing token merging have demonstrated substantial inference speedups (typically 1.6–2.2 $\times$ , up to nearly 7 $\times$ in video), dramatic reductions in FLOPs and memory, and only negligible drops (often <1%, sometimes negligible or even a gain) in top-1 classification accuracy, mIoU, or other quantitative task scores.

4. Trade-Offs, Limitations, and Design Considerations

Several empirical and theoretical findings have influenced best practices in token merging:

Placement and Timing: Merging too early (when patch features are low-level and less semantically meaningful) risks premature information loss (2303.02331, 2202.12015). Methods such as delayed spatial merging (DSM) and adaptively scheduled merging (SDTM) delay token reduction or perform it hierarchically (local first, global later) (2406.09936, 2505.11707).
Balancing Aggressiveness and Accuracy: Reducing tokens too aggressively (e.g., dropping below 8–32 tokens) can degrade fine-grained detail and performance, especially in small models or in tasks requiring spatial precision (2202.12015, 2406.09936).
Spatial Awareness and Structure Preservation: Relying exclusively on visual similarity can result in merging spatially distant (but visually similar) tokens or mixing content from different semantic objects. Recent approaches (ToSA (2506.20066), ALGM (2406.09936)) incorporate spatial priors—using depth images or patch positions—to better preserve scene layout.
Task-Specific Customization: Some tasks (e.g., human-centric pose estimation in TCFormer (2204.08680), document layout analysis in Fast-StrucTexT (2305.11392)) require dedicated merging strategies (e.g., importance-guided clustering, modality-guided dynamic merging) to preserve subtleties unique to their data.
Learnability and Training: Merging can be:
- Static/post-hoc: Applied at inference on frozen models (e.g., ToMe, DSM, PiToMe), supporting drop-in acceleration.
- Trainable/end-to-end: Parameters for merging are learned jointly with the model (LTM-Transformer, PatchMerger).
- Modular/fine-tunable: Approaches such as DTEM (2412.10569) learn dedicated embeddings for merging, decoupled from the main ViT features and trainable either standalone or with full network fine-tuning.
Distributional Shift Correction: Norm mismatches from naive averaging can result in drift in the token distribution; MLERP (2312.01026) and similar norm-preserving strategies mitigate this.

5. Recent Advances and Theoretical Understandings

Recent research (2023–2025) has produced several notable insights:

Energy-Based and Spectral Preservation: PiToMe introduces an "energy" score over a neighborhood graph to distinguish redundant regions from unique content, yielding improved spectral fidelity and state-of-the-art accuracy/efficiency trade-offs over previous bipartite matching approaches (2405.16148).
Spectrum Conservation: Theoretical results demonstrate that if token merging preserves the eigenvalue spectrum of the normalized Laplacian associated with the token interaction graph, then intrinsic structure and downstream performance are maintained (2405.16148).
Combinatorial and Adaptive Merging Strategies: SDTM for diffusion transformers applies structure-then-detail merging based on denoising priors, dynamically adjusting merging strategies as image features shift from global structure to fine detail during generation (2505.11707).
Spatially Aware and Modality Flexible Merging: ToSA integrates spatial information from depth data, employed more heavily in early layers, and transitions dynamically to semantic similarity in deeper layers, addressing the limitations of pure visual similarity especially for tasks involving object counting or spatial reasoning (2506.20066).
Unified Token Compression: The Prune and Merge approach uses learnable merge and reconstruct matrices to not only compress tokens but also enable their restoration, balancing efficiency and information preservation (2503.23455).

6. Methodological Comparisons

Visual token merge is frequently contrasted with token pruning (dropping tokens outright) and methods such as sequential pooling or dynamic attention. The hybrid "fusion" approaches (ToFu (2312.01026, 2503.04444)) combine both, using pruning in sensitive early layers and merging (possibly with norm correction) in later layers, achieving superior accuracy–efficiency trade-offs.

A representative summary table:

Method Category	Core Principle	Distinctive Features	Example Papers
Pruning	Discard tokens	Fast, but may irreversibly lose information	DynamicViT, PuMer
Merging	Combine tokens	Preserves info, reduces redundancy	PatchMerger, ToMe, PiToMe
Clustering	Group tokens	Flexible, e.g., DPC-KNN for adaptive shapes	TCFormer
Fusion/Hybrid	Combine approach	Pruning + Merging + Norm/Saliency correction	ToFu, FrameFusion
Spatial/Structural	Use extra spatial info	Integrate spatial/structural priors	ALGM, ToSA
Learned/IB-based	Supervised mask	Minimizes IB loss, adapts to downstream task	LTM-Transformer

7. Outlook and Future Directions

Open directions in visual token merge research include:

Adaptive and Input-Dependent Merging: Further development of content-adaptive thresholding, batch-wise adaptation, and task-driven spatial/semantic fusion (2406.09936, 2506.20066).
Integration with Model Compression: Synergies with pruning, quantization, or knowledge distillation for more aggressive compression in edge deployments (2503.23455).
Beyond Vision: Application to multimodal and vision-LLMs at scale, including multi-image and video LMMs (2501.01986, 2503.04444).
Task-Specific and Structure-Aware Strategies: Deeper integration of domain priors (e.g., denoising priors in generative models, spatial priors in embodied QA) for merging that is robust to task requirements (2505.11707).
Theoretical Analyses: Further formalization of the trade-offs involved, including spectral graph properties and information bottleneck perspectives (2405.16148, 2407.15219).

Visual token merging has moved from an efficiency-oriented engineering solution to a rich research area involving algorithmic innovations, theoretical guarantees, and task-sensitive customization, and now supports efficient, scalable transformer inference across diverse domains.