Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Token Merging Framework

Updated 7 April 2026
  • Dynamic token merging is a strategy that adaptively reduces token sequences based on content-aware signals to cut Transformer self-attention compute.
  • It leverages semantic, spatial, and saliency cues for per-sample and per-layer merging, balancing efficiency with minimal performance loss.
  • The framework supports hierarchical and domain-specific merging techniques, achieving significant FLOPs reduction in vision and language applications.

Dynamic Token Merging Framework

Dynamic token merging refers to a collection of algorithmic strategies designed to adaptively compress token sequences within Transformer-based architectures by merging, pruning, or transforming tokens at runtime or during lightweight post-processing. These frameworks target the reduction of quadratic self-attention complexity, improve throughput, and enable extreme-scale or latency-sensitive applications in vision, language, and multimodal domains. Dynamic token merging distinguishes itself by eschewing static, hand-tuned or one-off reductions in favor of mechanisms that decide per-sample, per-layer, or per-timestep how aggressively to merge or retain tokens, frequently according to input complexity, salience, or spatial priors.

1. Core Principles and Motivations

The canonical motivation for dynamic token merging frameworks is to address the excessive computational and memory cost of Transformer self-attention, which scales as O(N2D)O(N^2D) in the number of tokens NN and embedding dimension DD, particularly acute in high-resolution vision, video, and long-context sequence applications. Instead of fixed token pruning or static pooling, dynamic strategies leverage adaptive signals—such as similarity, salience, spatial structure, or task-level priors—to merge only redundant or low-importance tokens. Key conceptual advances include:

2. Methodological Families

Dynamic token merging encompasses a variety of algorithmic approaches across vision, language, and multimodal modeling. Representative methodologies include:

  • Similarity-Based Bipartite Matching: At each layer, partition tokens into disjoint sets (e.g., A, B) and merge the most similar pairs according to a measured similarity (e.g., cosine or dot-product in the key/query space). Merge quotas and selection can be statically scheduled or dynamically determined (Bolya et al., 2022, Wang et al., 23 Apr 2025, Heo et al., 2023, Huang et al., 24 Jun 2025).
  • Saliency- or Entropy-Guided Selection: Tokens are assigned per-sample saliency or importance scores, often derived from attention matrix entropy, norm magnitudes, or learned saliency heads. Merging and retention budgets are set adaptively based on input entropy/complexity (Lee et al., 2024, Liu et al., 16 Aug 2025).
  • Spatially Preserving/Windowed Merging: To maintain compatibility with window-attention or spatial architectures (e.g., SAM, Swin), merging is performed within local windows or follows spatial reduction strategies maintaining 2D layouts (Gong et al., 26 Sep 2025, Huang et al., 24 Jun 2025, Kienzle et al., 2024).
  • Many-to-Many “Token Transforming”: Generalizes merging and pruning as a matrix transformation Y=TXY=TX, where TT (not necessarily a block or diagonal matrix) is constructed from attention and similarity patterns, supporting non-exclusive mapping (Zeng et al., 6 Jun 2025).
  • Hash Table and Index Map for Video: Extreme token reduction in video employs K-Means clustering over patch tokens to create a compact token base and a grid-level index map for motion trajectory, preserving spatial-temporal structure even under severe compression (Zhang et al., 21 Mar 2025).
  • Hierarchical or Multi-step Frameworks: Multiple dynamic steps (expansion, merging, expansion-unmerging) may be composed to first select or densify informative regions before merging, or compress tokens and later expand for compatibility with downstream components (e.g., in LLMs or VLMs) (Wang et al., 23 Apr 2025, Ye et al., 10 Dec 2025).

3. Algorithmic Illustrations and Pseudocode Structures

Canonical dynamic merging procedures follow a generic per-layer architecture:

  1. Similarity/Saliency Calculation: Compute semantic (and possibly spatial/geometric) similarities between tokens, or derive saliency/importance scores as a function of attention, feature statistics, or external priors (Huang et al., 24 Jun 2025, Lee et al., 2024).
  2. Candidate Pair Selection: For merging, perform bipartite matching (typically greedy or conflict-avoiding) to select the top rr pairs based on similarity, or sample/retain tokens stochastically proportional to their saliency (Bolya et al., 2022, Lee et al., 2024).
  3. Fusion/Merging Operation: For each chosen pair (i,j)(i,j), generate a merged token via size- or norm-weighted averaging, max-magnitude per dimension, or addition, and update metadata (e.g., token sizes, ancestry/source maps) (Wang et al., 23 Apr 2025, Gong et al., 26 Sep 2025). Proportional attention or log-size corrections are applied post-merge to preserve correct weighting in subsequent attention blocks (Bolya et al., 2022, Wang et al., 23 Apr 2025).
  4. Dynamic Budget/Schedule Control: Merge quotas per layer may be controlled by static schedules, complexity-adaptive rules, or explicit multi-objective optimizers (e.g., Bayesian optimization to fit a Pareto frontier) (Erak et al., 11 Sep 2025, Lee et al., 2024).

Pseudocode follows a modular structure, differing in merging criteria, similarity computation, and selection logic. For example, the virtual token unmerging (VTU) module enables merged-token sequences to be expanded, maintaining full downstream compatibility (especially in VLMs and LLMs) through efficient remapping and attention reconstruction (Wang et al., 23 Apr 2025).

4. Practical Deployments and Experimental Outcomes

Dynamic token merging frameworks have been validated across a wide range of domains:

Domain Notable Frameworks Key Results
Vision (image) ToMe, ToSA, DSM, DyMU ImageNet/COCO: 40–70% FLOPs reduction, <1%<1\% accuracy loss, up to 1.8×1.8\times speedup
Video VTM, Token Dynamics, DyTo LVU/COIN: >6×>6\times speedup, up to 99.93% accuracy with 15–33% of tokens retained
Diffusion Gen. SDTM, CA-ToMe, D³ToM Stable Diffusion: NN0–NN1 speedup, NN2 FID increase
Language MrT5, Dynamic Tokenization XNLI/UNER: 20–75% length reduction, NN3 accuracy/F1 drop, large multilingual gains
VLMs DyMU, TEAM-VLA, MergeVQ LLaVA-Bench/LIBERO: NN4–NN5 speedup, NN6 relative accuracy
Semantic Comm. Adaptive Pareto-Optimal Transmission cost drops at fixed accuracy by SNR-adaptive merging (Erak et al., 11 Sep 2025)

Empirical studies across these frameworks consistently show that dynamic token merging yields substantial efficiency gains for transformer models, strongly reducing wall-clock latency, memory, and FLOP budgets with minimal or negligible degradation on downstream accuracy or generation metrics (Huang et al., 24 Jun 2025, Wang et al., 23 Apr 2025, Lee et al., 2024, Fang et al., 16 May 2025, Erak et al., 11 Sep 2025). Fine-grained merging schedules, content- and entropy-adaptive merging, as well as spatial structure awareness are universal contributors to superior trade-offs (Huang et al., 24 Jun 2025, Gong et al., 26 Sep 2025, Li et al., 17 Nov 2025, Kienzle et al., 2024).

5. Key Technical Variants and Comparative Insights

A representative taxonomy of dynamic token merging strategies includes:

Method/Framework Core Strategy Special Features Ref.
ToMe Bipartite similarity, static r Proportional attention, no retraining needed (Bolya et al., 2022)
ToSA Fused semantic/spatial similarity Depth-based spatial tokens, α schedule, ViT acceleration (Huang et al., 24 Jun 2025)
Dynamic VTM Saliency-guided, dynamic quota Learnable saliency head, average-pool merging, layerwise γ,α (Lee et al., 2024)
SDTM Structure/detail phase merging Attention-driven, local-global hybrid, prompt reweighting (Fang et al., 16 May 2025)
DyMU Dynamic per-image threshold Complexity-adaptive, virtual unmerging for LLM compatibility (Wang et al., 23 Apr 2025)
Token Transforming Sparse many-to-many transform Unified framework, attention-derived selection, dense tasks (Zeng et al., 6 Jun 2025)
CubistMerge 2D path-graph spatial merging Spatial grid preserved, max-magnitude merge (Gong et al., 26 Sep 2025)
MrT5 Learned delete gate (T5 encoder) Soft/hard deletion, multilingual, information merging (Kallini et al., 2024)
Dynamic Tokenization BPE-style, batch merges Embedding hypernetwork, retrofits fixed-LM (LORA fine-tune) (Feher et al., 2024)

Empirical ablations underscore that merging based on combined semantic and spatial signals (Huang et al., 24 Jun 2025, Gong et al., 26 Sep 2025), as well as adaptive budgeting (entropy, per-layer or per-sample) (Wang et al., 23 Apr 2025, Liu et al., 16 Aug 2025, Erak et al., 11 Sep 2025), outperforms uniform or static strategies. Virtual unmerging and source mapping are critical for compatibility with downstream components expecting fixed-length input (Wang et al., 23 Apr 2025, Li et al., 1 Apr 2025).

6. Limitations, Trade-offs, and Extensibility

Dynamic token merging frameworks, while highly effective for accelerating transformers, carry specific limitations and avenues for future development:

  • Compression-extremes: Aggressive merging/ultra-low token counts (e.g., NN7) can cause accuracy or fidelity degradation, especially on fine-grained tasks. Extensions using distillation, hybrid quantization, or learned controllers may compensate (Fang et al., 16 May 2025, Zhang et al., 21 Mar 2025).
  • Compatibility: Most methods are plug-and-play on standard ViT architectures; spatial- or RoPE-specific variants such as CubistMerge are required for advanced spatial backbones (Gong et al., 26 Sep 2025).
  • Overhead: Dynamic strategies can introduce algorithmic or implementation overhead (clustering, matching), which must be mitigated by efficient GPU kernels or approximations for real-time settings (Heo et al., 2023, Liu et al., 16 Aug 2025).
  • Semantic Preservation: Preserving critical rare or context-specific details requires sufficiently informative similarity, saliency, or external guidance. Prompt or action-aware reweighting—e.g., in SDTM and TEAM-VLA—improve robustness (Fang et al., 16 May 2025, Ye et al., 10 Dec 2025).
  • Extensibility: Dynamic token merging is increasingly being generalized to multi-modal, hierarchical, and streaming settings (e.g., multi-modal VLMs, real-time robotics, video LLMs, edge semantic communication), and paired with Pareto-front optimization for downstream efficiency-accuracy trade-off control (Erak et al., 11 Sep 2025, Zhang et al., 21 Mar 2025).

7. Cross-domain and Future Applications

Dynamic token merging is rapidly becoming foundational for efficient deep sequence modeling across vision, language, genomics, and multi-modal AI:

The unifying trajectory in the field is to deliver adaptive, context or task-aware token compression as a “first-class” architectural primitive for state-of-the-art Transformer pipelines across all major modalities, with fully pluggable and training-free instantiations now dominating practical and empirical benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Token Merging Framework.