Dynamic Token Merging Framework
- Dynamic token merging is a strategy that adaptively reduces token sequences based on content-aware signals to cut Transformer self-attention compute.
- It leverages semantic, spatial, and saliency cues for per-sample and per-layer merging, balancing efficiency with minimal performance loss.
- The framework supports hierarchical and domain-specific merging techniques, achieving significant FLOPs reduction in vision and language applications.
Dynamic Token Merging Framework
Dynamic token merging refers to a collection of algorithmic strategies designed to adaptively compress token sequences within Transformer-based architectures by merging, pruning, or transforming tokens at runtime or during lightweight post-processing. These frameworks target the reduction of quadratic self-attention complexity, improve throughput, and enable extreme-scale or latency-sensitive applications in vision, language, and multimodal domains. Dynamic token merging distinguishes itself by eschewing static, hand-tuned or one-off reductions in favor of mechanisms that decide per-sample, per-layer, or per-timestep how aggressively to merge or retain tokens, frequently according to input complexity, salience, or spatial priors.
1. Core Principles and Motivations
The canonical motivation for dynamic token merging frameworks is to address the excessive computational and memory cost of Transformer self-attention, which scales as in the number of tokens and embedding dimension , particularly acute in high-resolution vision, video, and long-context sequence applications. Instead of fixed token pruning or static pooling, dynamic strategies leverage adaptive signals—such as similarity, salience, spatial structure, or task-level priors—to merge only redundant or low-importance tokens. Key conceptual advances include:
- Adaptive/Content-aware Compression: Token reduction rates or merge targets are decided dynamically based on per-input or per-layer statistics (e.g., token similarity, entropy, or saliency scores), avoiding uniform loss of critical information (Wang et al., 23 Apr 2025, Huang et al., 24 Jun 2025, Lee et al., 2024, Erak et al., 11 Sep 2025).
- Semantic and Spatial Awareness: Integration of both semantic content and explicit spatial information (e.g., grid coordinates, depth, or geometry) enhances the fidelity of structural information during merging (Huang et al., 24 Jun 2025, Gong et al., 26 Sep 2025, Fang et al., 16 May 2025).
- Hierarchical and Stage-wise Processing: Multi-level or delayed merging schemes, often initiated only after feature convergence in lower layers, preserve early fine-grained information and promote robust abstraction (Heo et al., 2023, Gong et al., 26 Sep 2025, Zhang et al., 2024).
- Task- and Domain-specific Extensions: Custom dynamic schemes, such as action-guided or decoder-importance merging, align the compression process with task objectives such as control, generation, or semantic communication (Ye et al., 10 Dec 2025, Chang et al., 15 Nov 2025, Li et al., 1 Apr 2025).
2. Methodological Families
Dynamic token merging encompasses a variety of algorithmic approaches across vision, language, and multimodal modeling. Representative methodologies include:
- Similarity-Based Bipartite Matching: At each layer, partition tokens into disjoint sets (e.g., A, B) and merge the most similar pairs according to a measured similarity (e.g., cosine or dot-product in the key/query space). Merge quotas and selection can be statically scheduled or dynamically determined (Bolya et al., 2022, Wang et al., 23 Apr 2025, Heo et al., 2023, Huang et al., 24 Jun 2025).
- Saliency- or Entropy-Guided Selection: Tokens are assigned per-sample saliency or importance scores, often derived from attention matrix entropy, norm magnitudes, or learned saliency heads. Merging and retention budgets are set adaptively based on input entropy/complexity (Lee et al., 2024, Liu et al., 16 Aug 2025).
- Spatially Preserving/Windowed Merging: To maintain compatibility with window-attention or spatial architectures (e.g., SAM, Swin), merging is performed within local windows or follows spatial reduction strategies maintaining 2D layouts (Gong et al., 26 Sep 2025, Huang et al., 24 Jun 2025, Kienzle et al., 2024).
- Many-to-Many “Token Transforming”: Generalizes merging and pruning as a matrix transformation , where (not necessarily a block or diagonal matrix) is constructed from attention and similarity patterns, supporting non-exclusive mapping (Zeng et al., 6 Jun 2025).
- Hash Table and Index Map for Video: Extreme token reduction in video employs K-Means clustering over patch tokens to create a compact token base and a grid-level index map for motion trajectory, preserving spatial-temporal structure even under severe compression (Zhang et al., 21 Mar 2025).
- Hierarchical or Multi-step Frameworks: Multiple dynamic steps (expansion, merging, expansion-unmerging) may be composed to first select or densify informative regions before merging, or compress tokens and later expand for compatibility with downstream components (e.g., in LLMs or VLMs) (Wang et al., 23 Apr 2025, Ye et al., 10 Dec 2025).
3. Algorithmic Illustrations and Pseudocode Structures
Canonical dynamic merging procedures follow a generic per-layer architecture:
- Similarity/Saliency Calculation: Compute semantic (and possibly spatial/geometric) similarities between tokens, or derive saliency/importance scores as a function of attention, feature statistics, or external priors (Huang et al., 24 Jun 2025, Lee et al., 2024).
- Candidate Pair Selection: For merging, perform bipartite matching (typically greedy or conflict-avoiding) to select the top pairs based on similarity, or sample/retain tokens stochastically proportional to their saliency (Bolya et al., 2022, Lee et al., 2024).
- Fusion/Merging Operation: For each chosen pair , generate a merged token via size- or norm-weighted averaging, max-magnitude per dimension, or addition, and update metadata (e.g., token sizes, ancestry/source maps) (Wang et al., 23 Apr 2025, Gong et al., 26 Sep 2025). Proportional attention or log-size corrections are applied post-merge to preserve correct weighting in subsequent attention blocks (Bolya et al., 2022, Wang et al., 23 Apr 2025).
- Dynamic Budget/Schedule Control: Merge quotas per layer may be controlled by static schedules, complexity-adaptive rules, or explicit multi-objective optimizers (e.g., Bayesian optimization to fit a Pareto frontier) (Erak et al., 11 Sep 2025, Lee et al., 2024).
Pseudocode follows a modular structure, differing in merging criteria, similarity computation, and selection logic. For example, the virtual token unmerging (VTU) module enables merged-token sequences to be expanded, maintaining full downstream compatibility (especially in VLMs and LLMs) through efficient remapping and attention reconstruction (Wang et al., 23 Apr 2025).
4. Practical Deployments and Experimental Outcomes
Dynamic token merging frameworks have been validated across a wide range of domains:
| Domain | Notable Frameworks | Key Results |
|---|---|---|
| Vision (image) | ToMe, ToSA, DSM, DyMU | ImageNet/COCO: 40–70% FLOPs reduction, accuracy loss, up to speedup |
| Video | VTM, Token Dynamics, DyTo | LVU/COIN: speedup, up to 99.93% accuracy with 15–33% of tokens retained |
| Diffusion Gen. | SDTM, CA-ToMe, D³ToM | Stable Diffusion: 0–1 speedup, 2 FID increase |
| Language | MrT5, Dynamic Tokenization | XNLI/UNER: 20–75% length reduction, 3 accuracy/F1 drop, large multilingual gains |
| VLMs | DyMU, TEAM-VLA, MergeVQ | LLaVA-Bench/LIBERO: 4–5 speedup, 6 relative accuracy |
| Semantic Comm. | Adaptive Pareto-Optimal | Transmission cost drops at fixed accuracy by SNR-adaptive merging (Erak et al., 11 Sep 2025) |
Empirical studies across these frameworks consistently show that dynamic token merging yields substantial efficiency gains for transformer models, strongly reducing wall-clock latency, memory, and FLOP budgets with minimal or negligible degradation on downstream accuracy or generation metrics (Huang et al., 24 Jun 2025, Wang et al., 23 Apr 2025, Lee et al., 2024, Fang et al., 16 May 2025, Erak et al., 11 Sep 2025). Fine-grained merging schedules, content- and entropy-adaptive merging, as well as spatial structure awareness are universal contributors to superior trade-offs (Huang et al., 24 Jun 2025, Gong et al., 26 Sep 2025, Li et al., 17 Nov 2025, Kienzle et al., 2024).
5. Key Technical Variants and Comparative Insights
A representative taxonomy of dynamic token merging strategies includes:
| Method/Framework | Core Strategy | Special Features | Ref. |
|---|---|---|---|
| ToMe | Bipartite similarity, static r | Proportional attention, no retraining needed | (Bolya et al., 2022) |
| ToSA | Fused semantic/spatial similarity | Depth-based spatial tokens, α schedule, ViT acceleration | (Huang et al., 24 Jun 2025) |
| Dynamic VTM | Saliency-guided, dynamic quota | Learnable saliency head, average-pool merging, layerwise γ,α | (Lee et al., 2024) |
| SDTM | Structure/detail phase merging | Attention-driven, local-global hybrid, prompt reweighting | (Fang et al., 16 May 2025) |
| DyMU | Dynamic per-image threshold | Complexity-adaptive, virtual unmerging for LLM compatibility | (Wang et al., 23 Apr 2025) |
| Token Transforming | Sparse many-to-many transform | Unified framework, attention-derived selection, dense tasks | (Zeng et al., 6 Jun 2025) |
| CubistMerge | 2D path-graph spatial merging | Spatial grid preserved, max-magnitude merge | (Gong et al., 26 Sep 2025) |
| MrT5 | Learned delete gate (T5 encoder) | Soft/hard deletion, multilingual, information merging | (Kallini et al., 2024) |
| Dynamic Tokenization | BPE-style, batch merges | Embedding hypernetwork, retrofits fixed-LM (LORA fine-tune) | (Feher et al., 2024) |
Empirical ablations underscore that merging based on combined semantic and spatial signals (Huang et al., 24 Jun 2025, Gong et al., 26 Sep 2025), as well as adaptive budgeting (entropy, per-layer or per-sample) (Wang et al., 23 Apr 2025, Liu et al., 16 Aug 2025, Erak et al., 11 Sep 2025), outperforms uniform or static strategies. Virtual unmerging and source mapping are critical for compatibility with downstream components expecting fixed-length input (Wang et al., 23 Apr 2025, Li et al., 1 Apr 2025).
6. Limitations, Trade-offs, and Extensibility
Dynamic token merging frameworks, while highly effective for accelerating transformers, carry specific limitations and avenues for future development:
- Compression-extremes: Aggressive merging/ultra-low token counts (e.g., 7) can cause accuracy or fidelity degradation, especially on fine-grained tasks. Extensions using distillation, hybrid quantization, or learned controllers may compensate (Fang et al., 16 May 2025, Zhang et al., 21 Mar 2025).
- Compatibility: Most methods are plug-and-play on standard ViT architectures; spatial- or RoPE-specific variants such as CubistMerge are required for advanced spatial backbones (Gong et al., 26 Sep 2025).
- Overhead: Dynamic strategies can introduce algorithmic or implementation overhead (clustering, matching), which must be mitigated by efficient GPU kernels or approximations for real-time settings (Heo et al., 2023, Liu et al., 16 Aug 2025).
- Semantic Preservation: Preserving critical rare or context-specific details requires sufficiently informative similarity, saliency, or external guidance. Prompt or action-aware reweighting—e.g., in SDTM and TEAM-VLA—improve robustness (Fang et al., 16 May 2025, Ye et al., 10 Dec 2025).
- Extensibility: Dynamic token merging is increasingly being generalized to multi-modal, hierarchical, and streaming settings (e.g., multi-modal VLMs, real-time robotics, video LLMs, edge semantic communication), and paired with Pareto-front optimization for downstream efficiency-accuracy trade-off control (Erak et al., 11 Sep 2025, Zhang et al., 21 Mar 2025).
7. Cross-domain and Future Applications
Dynamic token merging is rapidly becoming foundational for efficient deep sequence modeling across vision, language, genomics, and multi-modal AI:
- Vision: Acceleration and scaling of ViTs, dense segmentation, detection, and generative diffusion models with negligible loss (Huang et al., 24 Jun 2025, Fang et al., 16 May 2025, Kienzle et al., 2024).
- Video: Streaming adaptation (DyTo, Token Dynamics) yields up to 8 reduction in token count for LLM conditioning without substantial performance drop (Zhang et al., 21 Mar 2025, Zhang et al., 2024).
- Language: Dynamic tokenization, learned or retrofitted for batch-specific subword merging, balances multilingual equity and inference cost (Feher et al., 2024, Kallini et al., 2024).
- Genomics: Hierarchical local-global dynamic tokenizers (MergeDNA) outperform fixed-vocabulary baselines and large DNA foundation models on biological sequence understanding (Li et al., 17 Nov 2025).
- Semantic Communication: Adaptive Pareto-optimal per-layer policies, selected by Bayesian optimization or mapped from environment conditions (e.g., channel SNR), trade accuracy and transmission cost on the fly (Erak et al., 11 Sep 2025).
- Action-Robotic Perception: Expansion-merge pipelines coupled with action-aware merging (TEAM-VLA) enable real-time inference in large VLA and robotic models (Ye et al., 10 Dec 2025).
The unifying trajectory in the field is to deliver adaptive, context or task-aware token compression as a “first-class” architectural primitive for state-of-the-art Transformer pipelines across all major modalities, with fully pluggable and training-free instantiations now dominating practical and empirical benchmarks.