Dynamic Token Merging in Transformers

Updated 4 March 2026

Dynamic token merging is a method that adaptively reduces token count in sequence models by merging similar tokens based on input-dependent criteria.
It employs techniques like clustering, submodular maximization, and learned thresholds to balance efficiency gains with minimal information loss.
Empirical evaluations demonstrate significant speedup, memory savings, and reduced FLOPs in diverse applications such as vision, text, and genomics with near-baseline performance.

Dynamic token merging encompasses a family of methods designed to adaptively reduce, at run time, the number of tokens processed in sequence models—transformers in particular—while minimizing information loss and preserving model fidelity. The approach underpins advances across vision, text, video, genomics, and multimodal processing, where the quadratic (or higher) complexity of self-attention or related mechanisms sets fundamental performance bottlenecks. Dynamic merging departs from static or fixed schedule schemes by leveraging input-dependent criteria—token similarity, saliency, informativeness, and context structure—allowing the model to adapt the compression pattern to each sample or time step. Techniques in this class are realized algorithmically via clustering (greedy or agglomerative), submodular selection, decoupled embedding projections, saliency-driven masking, and other optimization-based or learnable policies, with practical deployment ranging from pure inference-time modules to end-to-end differentiable architectures.

1. Foundational Principles and Mechanisms

Dynamic token merging is predicated on the observation that, given redundancy in high-dimensional input spaces, many tokens (patches, subwords, frame regions) in deep networks encode overlapping or uninformative content. Instead of uniform pruning or fixed-ratio merging, dynamic approaches evaluate token-wise or pairwise criteria such as:

Cosine similarity or normed dot product of token embeddings or attention keys, as in ToMe (Bolya et al., 2022), DyMU (Wang et al., 23 Apr 2025), and CA-ToMe (Saghatchian et al., 1 Jan 2025), identifying tokens that can be safely merged by weighted averaging.
Submodular maximization to select token subsets of maximal coverage/diversity under a similarity kernel, as in ToMA (Lu et al., 13 Sep 2025); the facility location function $f(S) = \sum_{v} \max_{u\in S} s(u,v)$ is maximized subject to a token budget, tractably approximated by greedy selection.
Learned thresholds or saliency heads that modulate per-token reduction based on endogenous importance estimates—e.g., LTMP (Bonnaerens et al., 2023) introduces learned pruning and merging thresholds, while saliency-predictive heads control selective aggregation in video and image transformers (Lee et al., 2024).
Locality and context-structured criteria, such as positionally-restricted merging in time series (Götz et al., 2024) or spatial-constrained bipartite matching in CubistMerge (Gong et al., 26 Sep 2025).

The merging operation itself ranges from simple size-weighted or norm-weighted averages (preserving feature scale for proportional attention), through max-magnitude-per-dimension reduction (CubistMerge) (Gong et al., 26 Sep 2025), to full attention-like matrix transformations (ToMA) (Lu et al., 13 Sep 2025).

Policies can be enhanced by:

Adaptive scheduling per layer or timestep—e.g., proportionally more aggressive merging in stages of high redundancy, or with varying thresholds based on content complexity (Wang et al., 23 Apr 2025, Chang et al., 15 Nov 2025, Fang et al., 16 May 2025).
Reuse and caching of merge assignments across neighboring layers or timesteps to minimize redundant computation and exploit slow-changing redundancy patterns (Lu et al., 13 Sep 2025, Saghatchian et al., 1 Jan 2025).
Differentiable relaxation of the discrete selection and merging decisions, using soft masks, Gumbel-softmax, or similar techniques, enabling direct gradient-based optimization for task performance (Lee et al., 2024, Li et al., 17 Nov 2025).

2. Algorithmic Realizations and System Integration

Dynamic token merging methods are implemented as modular algorithmic blocks, inserted at varying depths in transformer or sequence-processing pipelines. The canonical pipeline consists of:

At selected layers, grouping tokens into candidate merge pairs or clusters via local or global similarity search, greedy bipartite matching, submodular maximization, or clustering (agglomerative, K-means).
Aggregating each group into a new token, typically via a weighted sum, arithmetic mean, or specialized merge function (e.g., max-magnitude per channel), combined with update of the associated feature size for proportional attention.
(If applicable) Adjusting attention normalizations or positional embeddings to account for the compressed sequence (Bolya et al., 2022, Gong et al., 26 Sep 2025).
Propagating the reduced sequence to subsequent network layers; in diffusion models or autoregressive settings, integrating merged tokens with gating or dynamic scheduling to respect causality or denoising priors (Fang et al., 16 May 2025, Chang et al., 15 Nov 2025).
For zero-shot or post-training methods, merging is performed without retraining backbone parameters, while for methods such as DTEM (Lee et al., 2024), LTMP (Bonnaerens et al., 2023), or MergeDNA (Li et al., 17 Nov 2025), trainable merge modules are optimized by task loss with (optionally) decoupled or partially frozen backbone weights.

A sample pipeline for ToMA (Lu et al., 13 Sep 2025) is illustrative:

For each transformer block:
    If recompute assignment:
        Pick k centers via submodular greedy selection on similarity S
        Assign all tokens to nearest center
    Merge: group tokens by assignment, reduce Q/K/V via segmented sum
    Run attention on merged tokens
    Unmerge: scatter outputs to original slots via assignment index

Merging and unmerging are fused into efficient batched matrix operations, fully compatible with high-performance attention implementations (e.g., FlashAttention) and avoiding inefficiencies such as per-token scatter/gather or sorting.

3. Computational Complexity and Efficiency

The principal motivation for dynamic token merging is the reduction of computational and memory complexity from $O(N^2 d)$ to approximately $O(k^2 d)$ , with $k \ll N$ the adaptive token count after merging at each layer. Notably:

Practical speedup reaches 2–6× for latency and throughput with negligible or modest accuracy drops, as confirmed on high-resolution image and video pipelines (Bolya et al., 2022, Lee et al., 2024, Wang et al., 23 Apr 2025, Zhang et al., 21 Mar 2025).
Memory savings can reach 80–90% with aggressive reduction, as in long-form video and segmentation tasks (Lee et al., 2024).
The overhead of assignment selection and merging is typically subdominant: for ToMA and DyMU, optimization eliminates nearly all overhead relative to standard attention (Lu et al., 13 Sep 2025, Wang et al., 23 Apr 2025); in CA-ToMe (Saghatchian et al., 1 Jan 2025), caching further amortizes the remaining cost.
Layerwise scheduling and adaptivity matter: merging early yields more computational savings but risks information loss, while merging late may preserve fidelity but with reduced efficiency gains (Bolya et al., 2022, Fang et al., 16 May 2025).

In diffusion models (SDXL, FLUX), ToMA achieves 24%–23% generation latency reduction at controlled perceptual delta ( $\Delta < 0.07$ by DINO) (Lu et al., 13 Sep 2025). In MLLMs and vision-LLMs, dynamic adaptation to content complexity enables 32–85% reduction in average token count with near-baseline performance (Wang et al., 23 Apr 2025).

4. Empirical Evaluation, Trade-offs, and Ablation Studies

Empirical results establish that dynamic token merging delivers state-of-the-art tradeoffs across classification, generation, and structured understanding tasks in both vision and multimodal domains. Highlights:

ImageNet Classification: Off-the-shelf methods (ATC (Haurum et al., 2024), ToMe (Bolya et al., 2022), CubistMerge (Gong et al., 26 Sep 2025)) maintain or improve top-1 accuracy with 30–60% reduction in FLOPs; learned or decoupled embedding schemes (DTEM (Lee et al., 2024), LTMP (Bonnaerens et al., 2023)) further enhance the accuracy/FLOPs frontier, requiring only single-epoch fine-tuning.
Long-form Video: Hierarchy- and saliency-guided mergers (Video Token Merging (Lee et al., 2024), DyTo (Zhang et al., 2024), Token Dynamics (Zhang et al., 21 Mar 2025)) enable up to 0.07% of original tokens with <1.2% accuracy drop, representing a six-order-of-magnitude reduction in quadratic cost and 2× speed improvement.
Text and Language: Dynamic subword merging (Feher et al., 2024) and learned byte-level deletion (Kallini et al., 2024) achieve 20–60% sequence reduction with <2% or even negligible downstream degradation, addressing cross-lingual and noisy input challenges.
SSMs and Time Series: Local or Δ-weighted merging (Götz et al., 2024, Park et al., 19 Aug 2025) retains sequential and causal properties, with merge scheduling and spectral adaptation yielding order-of-magnitude improvements suitable for long chronosequences or genomic data.
Diffusion Models: Methods including ToMA (Lu et al., 13 Sep 2025), D³ToM (Chang et al., 15 Nov 2025), SDTM (Fang et al., 16 May 2025), CA-ToMe (Saghatchian et al., 1 Jan 2025), and their ablations show that the right combination of dynamic scheduling, token selection, caching, and post-processing can achieve >1.2× acceleration with FID, CLIP, or mIoU nearly matching full-token baselines.

Ablation studies consistently show that:

Adaptive and/or submodular or saliency-driven selection vastly outperforms uniform or heuristically fixed merging.
Reuse/caching across layers or time steps is critical to actual efficiency gains, especially in diffusion settings (Lu et al., 13 Sep 2025, Saghatchian et al., 1 Jan 2025).
Feature-fidelity is preserved when merges respect proportional weighting and structure (e.g., in proportional attention (Bolya et al., 2022), or spatial/temporal structure (Gong et al., 26 Sep 2025, Zhang et al., 21 Mar 2025)).

5. Modalities, Extensions, and Cross-Domain Adaptation

Dynamic token merging is empirically validated across domains:

Vision (images, segmentation, detection): Integration into ViTs, SAM, Mask2Former, and state-of-the-art segmentation/recognition pipelines, with spatial-structural preservation (CubistMerge (Gong et al., 26 Sep 2025)) delivering domain-agnostic, training-free speedups.
Video and Multimodal: Saliency-focused and clustering-based schemes enable scaling to hundreds of frames and thousands of tokens per sample, necessary for structured QA, summarization, and generation (Zhang et al., 21 Mar 2025, Zhang et al., 2024, Lee et al., 2024).
Text and Multilingual: Subword or byte-level dynamic merging (retrofit BPE (Feher et al., 2024), MrT5 (Kallini et al., 2024)) enables improved compression and fairness across languages, with adaptive hypernetwork-based embedding generation supporting on-the-fly vocabulary extension.
Genomics: Context-aware dynamic tokenization (MergeDNA (Li et al., 17 Nov 2025)) based on hierarchical local-window merging integrates smoothly with masked pre-training, achieving top results across diverse nucleotide benchmarks.
Time Series/State Space: Local, causal, or Δ-weighted schemes exploit temporal redundancy, maintaining causality and adapting seamlessly to both SSMs and attention-based models (Park et al., 19 Aug 2025, Götz et al., 2024).

Dynamic methods are agnostic to backbone architecture, extensible across tasks, and compatible with off-the-shelf or fine-tuned models, with or without access to end-to-end retraining.

6. Limitations, Open Problems, and Future Directions

Despite strong empirical results, key challenges and open problems include:

Implementation Efficiency: Token assignment, grouping, and merging steps introduce non-trivial overhead, particularly on GPU. Efficient realization with parallel matrix ops and fused kernels (as in ToMA (Lu et al., 13 Sep 2025)) remains an active area, especially at scale and in memory-constrained regimes.
Scheduling and Budgeting: Optimal allocation of merge ratios per layer, block, or time step is highly architecture-dependent and sensitive to task; dynamic and input-adaptive scheduling, while effective, may require careful calibration or meta-optimization (Erak et al., 11 Sep 2025, Wang et al., 23 Apr 2025).
Information Loss and Fidelity: Aggressive reduction inevitably entails some loss. Advanced policies such as submodular maximization, saliency or structure-aware merging, and proportional attention help bound degradation, but formal guarantees are limited.
Compatibility with Learned Positional and Structure Priors: In spatial and sequentially-structured architectures, preservation of layout and positional encoding is crucial; not all merging strategies are compatible with, e.g., RoPE or decomposed relative positional embeddings (Gong et al., 26 Sep 2025).
Batched/Parallel Processing: Algorithms such as agglomerative clustering are challenging to batch efficiently without GPU-optimized libraries (Haurum et al., 2024).
Extension to Causal/Decoder Architectures: Preserving strict autoregressive causality in decoders, especially with bidirectional or local merging, demands careful masking or architectural modification (Götz et al., 2024).

Future work includes principled dynamic scheduling via meta-learning, spectral characterization of mergeability (Götz et al., 2024), fully differentiable and context-adaptive merging in domain-agnostic pipelines (Lee et al., 2024, Li et al., 17 Nov 2025), and further integration in state space, multimodal, and reinforcement learning architectures.

7. Representative Algorithms and Comparative Table

Method	Merge Criterion	Adaptivity	Training Required	Empirical Efficiency	Primary Domain	Reference
ToMe	Cosine similarity, bipartite	Per-input	No	2× throughput, <0.3% drop	Images, video, audio	(Bolya et al., 2022)
ATC	Agglomerative clustering	Per-input	No	↑ accuracy at low rates	Images, synthesis, detection	(Haurum et al., 2024)
LTMP	Learned thresholds, masking	Per-input	1-epoch	↑ SOTA accuracy, fast tune	Image classification	(Bonnaerens et al., 2023)
ToMA	Submodular selection, linear	Per-step, cache	No	24% faster SDXL, <0.07Δ DINO	Diffusion image generation	(Lu et al., 13 Sep 2025)
DyMU (DToMe)	Similarity, complexity-adaptive	Per-image	No	32–85% fewer tokens, ≈100% perf	Image, video	(Wang et al., 23 Apr 2025)
DTEM	Decoupled, learnable embedding	Per-input	Modular/Full	+0.3–1% vs ToMe/EViT	Classification/Segm/Caption	(Lee et al., 2024)
Video Token Merger	Saliency-driven, learned head	Per-scene	Yes (saliency only)	84% mem, 6.9× speedup	Long video (LVU, COIN, etc.)	(Lee et al., 2024)
Token Dynamics	K-means hashing, map+cross-att	Per-video (adaptive/global)	No	Tokens 0.07%, ≤1.13% drop	Video LLMs	(Zhang et al., 21 Mar 2025)
MergeDNA	Local-window, differentiable	Hierarchical	Yes	+1.6% SOTA accuracy	Genomics and multi-omics	(Li et al., 17 Nov 2025)
CA-ToMe	Sim threshold, EMA, cache	Adaptive, cache	No	1.25× SD1.5, ~0 FID drop	Diffusion, denoising	(Saghatchian et al., 1 Jan 2025)

This table summarizes several principal dynamic token merging methods, with prioritization criteria, adaptivity, training requirements, representative efficiency gains, and application domain.

Dynamic token merging thus represents a unified framework for principled, input-adaptive token reduction across sequence modeling domains, significantly mitigating the compute bottlenecks of self-attention and related mechanisms without compromising the semantic or structural fidelity of model outputs.