Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Token Merging in Transformers

Updated 4 March 2026
  • Dynamic token merging is a method that adaptively reduces token count in sequence models by merging similar tokens based on input-dependent criteria.
  • It employs techniques like clustering, submodular maximization, and learned thresholds to balance efficiency gains with minimal information loss.
  • Empirical evaluations demonstrate significant speedup, memory savings, and reduced FLOPs in diverse applications such as vision, text, and genomics with near-baseline performance.

Dynamic token merging encompasses a family of methods designed to adaptively reduce, at run time, the number of tokens processed in sequence models—transformers in particular—while minimizing information loss and preserving model fidelity. The approach underpins advances across vision, text, video, genomics, and multimodal processing, where the quadratic (or higher) complexity of self-attention or related mechanisms sets fundamental performance bottlenecks. Dynamic merging departs from static or fixed schedule schemes by leveraging input-dependent criteria—token similarity, saliency, informativeness, and context structure—allowing the model to adapt the compression pattern to each sample or time step. Techniques in this class are realized algorithmically via clustering (greedy or agglomerative), submodular selection, decoupled embedding projections, saliency-driven masking, and other optimization-based or learnable policies, with practical deployment ranging from pure inference-time modules to end-to-end differentiable architectures.

1. Foundational Principles and Mechanisms

Dynamic token merging is predicated on the observation that, given redundancy in high-dimensional input spaces, many tokens (patches, subwords, frame regions) in deep networks encode overlapping or uninformative content. Instead of uniform pruning or fixed-ratio merging, dynamic approaches evaluate token-wise or pairwise criteria such as:

  • Cosine similarity or normed dot product of token embeddings or attention keys, as in ToMe (Bolya et al., 2022), DyMU (Wang et al., 23 Apr 2025), and CA-ToMe (Saghatchian et al., 1 Jan 2025), identifying tokens that can be safely merged by weighted averaging.
  • Submodular maximization to select token subsets of maximal coverage/diversity under a similarity kernel, as in ToMA (Lu et al., 13 Sep 2025); the facility location function f(S)=vmaxuSs(u,v)f(S) = \sum_{v} \max_{u\in S} s(u,v) is maximized subject to a token budget, tractably approximated by greedy selection.
  • Learned thresholds or saliency heads that modulate per-token reduction based on endogenous importance estimates—e.g., LTMP (Bonnaerens et al., 2023) introduces learned pruning and merging thresholds, while saliency-predictive heads control selective aggregation in video and image transformers (Lee et al., 2024).
  • Locality and context-structured criteria, such as positionally-restricted merging in time series (Götz et al., 2024) or spatial-constrained bipartite matching in CubistMerge (Gong et al., 26 Sep 2025).

The merging operation itself ranges from simple size-weighted or norm-weighted averages (preserving feature scale for proportional attention), through max-magnitude-per-dimension reduction (CubistMerge) (Gong et al., 26 Sep 2025), to full attention-like matrix transformations (ToMA) (Lu et al., 13 Sep 2025).

Policies can be enhanced by:

2. Algorithmic Realizations and System Integration

Dynamic token merging methods are implemented as modular algorithmic blocks, inserted at varying depths in transformer or sequence-processing pipelines. The canonical pipeline consists of:

  • At selected layers, grouping tokens into candidate merge pairs or clusters via local or global similarity search, greedy bipartite matching, submodular maximization, or clustering (agglomerative, K-means).
  • Aggregating each group into a new token, typically via a weighted sum, arithmetic mean, or specialized merge function (e.g., max-magnitude per channel), combined with update of the associated feature size for proportional attention.
  • (If applicable) Adjusting attention normalizations or positional embeddings to account for the compressed sequence (Bolya et al., 2022, Gong et al., 26 Sep 2025).
  • Propagating the reduced sequence to subsequent network layers; in diffusion models or autoregressive settings, integrating merged tokens with gating or dynamic scheduling to respect causality or denoising priors (Fang et al., 16 May 2025, Chang et al., 15 Nov 2025).
  • For zero-shot or post-training methods, merging is performed without retraining backbone parameters, while for methods such as DTEM (Lee et al., 2024), LTMP (Bonnaerens et al., 2023), or MergeDNA (Li et al., 17 Nov 2025), trainable merge modules are optimized by task loss with (optionally) decoupled or partially frozen backbone weights.

A sample pipeline for ToMA (Lu et al., 13 Sep 2025) is illustrative:

1
2
3
4
5
6
7
For each transformer block:
    If recompute assignment:
        Pick k centers via submodular greedy selection on similarity S
        Assign all tokens to nearest center
    Merge: group tokens by assignment, reduce Q/K/V via segmented sum
    Run attention on merged tokens
    Unmerge: scatter outputs to original slots via assignment index
Merging and unmerging are fused into efficient batched matrix operations, fully compatible with high-performance attention implementations (e.g., FlashAttention) and avoiding inefficiencies such as per-token scatter/gather or sorting.

3. Computational Complexity and Efficiency

The principal motivation for dynamic token merging is the reduction of computational and memory complexity from O(N2d)O(N^2 d) to approximately O(k2d)O(k^2 d), with kNk \ll N the adaptive token count after merging at each layer. Notably:

In diffusion models (SDXL, FLUX), ToMA achieves 24%–23% generation latency reduction at controlled perceptual delta (Δ<0.07\Delta < 0.07 by DINO) (Lu et al., 13 Sep 2025). In MLLMs and vision-LLMs, dynamic adaptation to content complexity enables 32–85% reduction in average token count with near-baseline performance (Wang et al., 23 Apr 2025).

4. Empirical Evaluation, Trade-offs, and Ablation Studies

Empirical results establish that dynamic token merging delivers state-of-the-art tradeoffs across classification, generation, and structured understanding tasks in both vision and multimodal domains. Highlights:

Ablation studies consistently show that:

5. Modalities, Extensions, and Cross-Domain Adaptation

Dynamic token merging is empirically validated across domains:

  • Vision (images, segmentation, detection): Integration into ViTs, SAM, Mask2Former, and state-of-the-art segmentation/recognition pipelines, with spatial-structural preservation (CubistMerge (Gong et al., 26 Sep 2025)) delivering domain-agnostic, training-free speedups.
  • Video and Multimodal: Saliency-focused and clustering-based schemes enable scaling to hundreds of frames and thousands of tokens per sample, necessary for structured QA, summarization, and generation (Zhang et al., 21 Mar 2025, Zhang et al., 2024, Lee et al., 2024).
  • Text and Multilingual: Subword or byte-level dynamic merging (retrofit BPE (Feher et al., 2024), MrT5 (Kallini et al., 2024)) enables improved compression and fairness across languages, with adaptive hypernetwork-based embedding generation supporting on-the-fly vocabulary extension.
  • Genomics: Context-aware dynamic tokenization (MergeDNA (Li et al., 17 Nov 2025)) based on hierarchical local-window merging integrates smoothly with masked pre-training, achieving top results across diverse nucleotide benchmarks.
  • Time Series/State Space: Local, causal, or Δ-weighted schemes exploit temporal redundancy, maintaining causality and adapting seamlessly to both SSMs and attention-based models (Park et al., 19 Aug 2025, Götz et al., 2024).

Dynamic methods are agnostic to backbone architecture, extensible across tasks, and compatible with off-the-shelf or fine-tuned models, with or without access to end-to-end retraining.

6. Limitations, Open Problems, and Future Directions

Despite strong empirical results, key challenges and open problems include:

  • Implementation Efficiency: Token assignment, grouping, and merging steps introduce non-trivial overhead, particularly on GPU. Efficient realization with parallel matrix ops and fused kernels (as in ToMA (Lu et al., 13 Sep 2025)) remains an active area, especially at scale and in memory-constrained regimes.
  • Scheduling and Budgeting: Optimal allocation of merge ratios per layer, block, or time step is highly architecture-dependent and sensitive to task; dynamic and input-adaptive scheduling, while effective, may require careful calibration or meta-optimization (Erak et al., 11 Sep 2025, Wang et al., 23 Apr 2025).
  • Information Loss and Fidelity: Aggressive reduction inevitably entails some loss. Advanced policies such as submodular maximization, saliency or structure-aware merging, and proportional attention help bound degradation, but formal guarantees are limited.
  • Compatibility with Learned Positional and Structure Priors: In spatial and sequentially-structured architectures, preservation of layout and positional encoding is crucial; not all merging strategies are compatible with, e.g., RoPE or decomposed relative positional embeddings (Gong et al., 26 Sep 2025).
  • Batched/Parallel Processing: Algorithms such as agglomerative clustering are challenging to batch efficiently without GPU-optimized libraries (Haurum et al., 2024).
  • Extension to Causal/Decoder Architectures: Preserving strict autoregressive causality in decoders, especially with bidirectional or local merging, demands careful masking or architectural modification (Götz et al., 2024).

Future work includes principled dynamic scheduling via meta-learning, spectral characterization of mergeability (Götz et al., 2024), fully differentiable and context-adaptive merging in domain-agnostic pipelines (Lee et al., 2024, Li et al., 17 Nov 2025), and further integration in state space, multimodal, and reinforcement learning architectures.

7. Representative Algorithms and Comparative Table

Method Merge Criterion Adaptivity Training Required Empirical Efficiency Primary Domain Reference
ToMe Cosine similarity, bipartite Per-input No 2× throughput, <0.3% drop Images, video, audio (Bolya et al., 2022)
ATC Agglomerative clustering Per-input No ↑ accuracy at low rates Images, synthesis, detection (Haurum et al., 2024)
LTMP Learned thresholds, masking Per-input 1-epoch ↑ SOTA accuracy, fast tune Image classification (Bonnaerens et al., 2023)
ToMA Submodular selection, linear Per-step, cache No 24% faster SDXL, <0.07Δ DINO Diffusion image generation (Lu et al., 13 Sep 2025)
DyMU (DToMe) Similarity, complexity-adaptive Per-image No 32–85% fewer tokens, ≈100% perf Image, video (Wang et al., 23 Apr 2025)
DTEM Decoupled, learnable embedding Per-input Modular/Full +0.3–1% vs ToMe/EViT Classification/Segm/Caption (Lee et al., 2024)
Video Token Merger Saliency-driven, learned head Per-scene Yes (saliency only) 84% mem, 6.9× speedup Long video (LVU, COIN, etc.) (Lee et al., 2024)
Token Dynamics K-means hashing, map+cross-att Per-video (adaptive/global) No Tokens 0.07%, ≤1.13% drop Video LLMs (Zhang et al., 21 Mar 2025)
MergeDNA Local-window, differentiable Hierarchical Yes +1.6% SOTA accuracy Genomics and multi-omics (Li et al., 17 Nov 2025)
CA-ToMe Sim threshold, EMA, cache Adaptive, cache No 1.25× SD1.5, ~0 FID drop Diffusion, denoising (Saghatchian et al., 1 Jan 2025)

This table summarizes several principal dynamic token merging methods, with prioritization criteria, adaptivity, training requirements, representative efficiency gains, and application domain.


Dynamic token merging thus represents a unified framework for principled, input-adaptive token reduction across sequence modeling domains, significantly mitigating the compute bottlenecks of self-attention and related mechanisms without compromising the semantic or structural fidelity of model outputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Token Merging.