Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Patch Merging in Vision Transformers

Updated 12 April 2026
  • Dynamic Patch Merging (dCTS) is a technique that adaptively merges image patches in vision transformers based on content significance and learned heuristics.
  • It employs strategies such as learned policy networks, clustering with importance scoring, and boundary prediction to efficiently reduce token counts while retaining key features.
  • dCTS achieves significant computational gains—up to 3.5× speedup and reduced memory usage—while maintaining competitive accuracy across segmentation and generative tasks.

Dynamic Patch Merging (dCTS) represents a family of algorithms and architectural modules for adaptively reducing the number of tokens in vision transformers and related architectures by dynamically grouping, pooling, or merging input patches (or tokens) based on sample-dependent heuristics, learned scoring functions, or content-aware predictors. Unlike static downsampling or fixed pooling strategies, dCTS mechanisms enable efficient computation by concentrating representation capacity on semantically or structurally complex regions while aggressively merging homogeneous or low-importance areas, yielding significant computational, memory, and throughput gains across domains including high-resolution semantic segmentation, vision-language pretraining, and generative diffusion modeling (Szczepanski et al., 17 Sep 2025, Peng et al., 29 Oct 2025, Sun et al., 2024, Kim et al., 19 Feb 2026).

1. Core Principles and Variants of Dynamic Patch Merging

The unifying tenet of dCTS approaches is the dynamic, input-dependent adaptation of tokenization granularity within transformer-based vision architectures. Rather than partitioning an input image into a uniformly spaced, fixed grid of small patches (e.g., 16×1616\times 16 in standard ViT), dCTS modules determine—at runtime or via learned policies—which image regions can be efficiently represented by merged "superpatches" or segment-wise aggregations. Key implementation strategies include:

  • Learned Policy Networks: As in STEP, a lightweight CNN-based policy network (EfficientNet-Lite0) computes homogeneity or similarity scores over recursive patch groups. Windows exceeding an adaptive threshold are merged, ensuring uniform spatial size after raster concatenation and resizing (Szczepanski et al., 17 Sep 2025).
  • Feature Clustering and Importance Scoring: In Patch-Merging Transformer (PMT), tokens are clustered according to a DPC-KNN density and distance product metric; importance-weighted merging and residual feature updates via cross-attention preserve information from dense or intricate regions (Sun et al., 2024).
  • Boundary-Predictor and Differentiable Pooling: DRIP and similar frameworks use a shallow MLP with Gumbel-Sigmoid gating to assign binary boundaries between tokens, segmenting and pooling embeddings dynamically. This boundary rate aligns with semantic or object structure, facilitating interpretable reductions (Peng et al., 29 Oct 2025).
  • Dynamic Patch Scheduling in Generative Models: DDiT varies the patch size per diffusion timestep according to latent 'acceleration' (third-order finite differences of denoising latents), adapting computation and granularity throughout the generation process (Kim et al., 19 Feb 2026).

A summary of leading dCTS system characteristics is presented below:

Framework Granularity Choice Merge Trigger Pooling Method
STEP/dCTS Hierarchical patch windows Similarity score (CNN) > τ\tau Raster concat + resize
PMT/dCTS Token clusters (variable) DPC-KNN importance Weighted softmax + cross-attn
DRIP/dCTS Sequential segments Boundary MLP (Gumbel-Sigmoid) Average pooling
DDiT/dCTS Patch size per timestep Per-step latent statistics Patch-embed + LoRA

2. Mathematical Formulations and Algorithmic Workflows

dCTS algorithms are characterized by content-dependent, granularity-adaptive processing, typically implemented by a two-stage procedure: (1) importance or boundary prediction followed by (2) segment formation and pooling.

2.1 STEP/dCTS Formulation

  • Similarity score for a window W\mathcal{W}:

S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]

  • Merge decision: If S>τwindowS > \tau_\text{window}, all nn patches in W\mathcal{W} are merged.
  • Superpatch construction: Raster-concatenate input patches, bilinear resize to 16×1616 \times 16, linearly embed to token space.

No merge-specific loss is used; dCTS is trained end-to-end with standard segmentation losses (Szczepanski et al., 17 Sep 2025).

2.2 PMT/dCTS Cluster-Based Merging

  • Local density: ρi=exp[1kxjKNN(xi)xixj2]{\rho_i} = \exp\big[ -\frac{1}{k} \sum_{x_j \in \text{KNN}(x_i)} \|x_i - x_j\|^2 \big]
  • Distance to higher-density point: δi\delta_i, defined piecewise based on neighborhood density.
  • Importance: τ\tau0
  • Selection: Top τ\tau1 tokens (by τ\tau2) are cluster centers.
  • Merging: Within clusters, importance-weighted sum via softmax scores.
  • Residual update: Cross-attention combines merged tokens with the full set of original tokens (Sun et al., 2024).

2.3 DRIP/dCTS Segmentwise Pooling

  • Boundary prediction: For τ\tau3 input tokens τ\tau4, a 2-layer MLP outputs logits τ\tau5, translated to probabilities τ\tau6.
  • Gumbel-Sigmoid Sampling:

τ\tau7

with τ\tau8, used to produce binary boundaries τ\tau9.

2.4 DDiT/dCTS Adaptive Patch Scheduling

  • For each diffusion inference step W\mathcal{W}0, compute the third-order finite-difference:

W\mathcal{W}1

  • Partition into patches of candidate sizes W\mathcal{W}2; for each, compute the ρ-th percentile of standard deviations W\mathcal{W}3.
  • Choice: Select largest W\mathcal{W}4 with W\mathcal{W}5; otherwise use finest patching.
  • Patch-embed, apply transformer step, and proceed to next timestep (Kim et al., 19 Feb 2026).

3. Integration into Transformer Architectures

Integration strategies for dCTS modules depend on architectural and task-specific requirements:

  • ViT/Semantic Segmentation: Merged patches or superpatches are resized to a canonical size (e.g., W\mathcal{W}6 or W\mathcal{W}7), linearly embedded, and fed to downstream transformer encoder blocks. Learned position embeddings are transferred by bilinear interpolation, maintaining spatial geometry (Szczepanski et al., 17 Sep 2025, Sun et al., 2024).
  • Dual- or Multi-Scale Backbones: dCTS augments or replaces hierarchical downsampling with dynamic, scene-dependent granularity selection, supporting both single-branch and multi-branch architectures for segmentation (Sun et al., 2024).
  • Vision-Language Pretraining: DRIP's dynamically pooled token set is padded and masked for continuity in transformer operations, enabling flexible batch processing despite variable token count across examples (Peng et al., 29 Oct 2025).
  • Diffusion Models: DDiT introduces separate linear patch embeddings per patch size, interpolated positional encodings, and LoRA adapters for minimal fine-tuning. At inference, patch-embed modules are switched dynamically, while attention and feed-forward layers remain unchanged (Kim et al., 19 Feb 2026).

4. Computational Complexity and Efficiency Gains

dCTS modules universally target substantial reductions in token count, FLOPs, and memory usage without significant performance loss.

  • STEP/dCTS (ViT-Large on 1024×1024 images):
    • Tokens: W\mathcal{W}8 (W\mathcal{W}92.5× reduction)
    • FLOPs: S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]0 GFLOPs (S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]12.6× reduction)
    • Throughput: S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]2 FPS (S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]33.4× boost)
    • S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]4 drop in mIoU for segmentation (Szczepanski et al., 17 Sep 2025).
  • PMT/dCTS (Cityscapes):
    • Memory usage: S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]5 (single-branch) vs S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]6 (dual-branch) with higher mIoU (Sun et al., 2024).
  • DRIP/dCTS (ViT-B-16):
    • 4× pooling: S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]7 GFLOPs (S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]81.33× speedup)
    • Top-1 ImageNet drop S=σ(WpTembedding(W))[0,1]S = \sigma(W_p^T \mathrm{embedding}(\mathcal{W})) \in [0,1]9 at 4× reduction (Peng et al., 29 Oct 2025).
  • DDiT/dCTS (Diffusion):
    • Up to S>τwindowS > \tau_\text{window}0 speedup in image generation, with S>τwindowS > \tau_\text{window}1 FID increase at most on FLUX-1.Dev (Kim et al., 19 Feb 2026).

A table summarizing measured reductions:

Method & Domain Token Reduction FLOP Reduction Throughput Speedup Principal Performance Impact
STEP/dCTS (Szczepanski et al., 17 Sep 2025) 2.5× 2.6× 3.4× S>τwindowS > \tau_\text{window}21.5% mIoU drop
PMT/dCTS (Sun et al., 2024) Adaptive S>τwindowS > \tau_\text{window}32.1S>τwindowS > \tau_\text{window}4 - S>τwindowS > \tau_\text{window}52.7% mIoU if dCTS removed
DRIP/dCTS (Peng et al., 29 Oct 2025) 4–10× 1.33–1.77× up to 1.8× S>τwindowS > \tau_\text{window}61% top-1 drop (ImageNet)
DDiT/dCTS (Kim et al., 19 Feb 2026) Adaptive up to 3.52× up to 3.52× S>τwindowS > \tau_\text{window}7 FID (FLUX-1)

5. Empirical Results and Ablation Studies

Experiments across segmentation, vision-language pretraining, and diffusion synthesis tasks consistently report substantial savings in computational cost with minimal to moderate baseline metric degradation:

  • STEP/dCTS: S>τwindowS > \tau_\text{window}8 absolute mIoU drop across Cityscapes, COCOStuff10k, ADE20K. Most merges occur at the 2×2 patch level; larger merges are rare, especially in visually complex scenes (Szczepanski et al., 17 Sep 2025).
  • PMT/dCTS: On DeepGlobe, dCTS outperforms static convolutional downsampling by 2.7% mIoU; memory use is consistently lower than dual-branch comparators with state-of-the-art performance (Sun et al., 2024).
  • DRIP/dCTS: 4× token reduction yields only 0.6% ImageNet top-1 loss; CLIP zero-shot retrieval is also retained. Replacing dynamic pooling with static pooling reduces accuracy, highlighting the merit of adaptive merging (Peng et al., 29 Oct 2025).
  • DDiT/dCTS: Best results with order-3 finite-difference "acceleration" signal for patch scheduling. Threshold tuning trades off speed for quality, but even at 3.5× acceleration, CLIP similarity drops by only S>τwindowS > \tau_\text{window}9 (Kim et al., 19 Feb 2026).

A plausible implication is that the informatics of adaptive pooling or merging inherently preserves salient structures in deep representations, limiting the performance penalty even at aggressive compression levels.

6. Applications and Generalization

Dynamic patch merging has proven effective in a range of domains:

  • Semantic Segmentation: Enables practical high-resolution training and inference on images up to nn0, with adaptive focus on object boundaries and fine-grained details (Szczepanski et al., 17 Sep 2025, Sun et al., 2024).
  • Vision-LLMs: DRIP demonstrates efficient pretraining and continual domain adaptation (e.g., BioCLIP) with substantial memory and flops savings, agnostic to backbone architecture (Peng et al., 29 Oct 2025).
  • Diffusion-Based Generative Models: DDiT achieves significant test-time speedup for both image and video synthesis, maintaining generation quality and consistency with textual or visual prompts (Kim et al., 19 Feb 2026).
  • General Vision Tasks: dCTS modules can be readily integrated into detection (small object emphasis), depth estimation (discontinuity-aware pooling), and video processing (motion boundary preservation) (Sun et al., 2024).

The evidence suggests that dCTS is a general, extensible paradigm for dynamic token adaptation in large-scale vision transformers, complementing or surpassing static and two-branch granularity strategies.

7. Limitations, Challenges, and Future Directions

While dCTS modules achieve favorable trade-offs between computational efficiency and representation integrity, some important considerations include:

  • Trade-off Tuning: Thresholds and boundary rates introduce task- and data-dependent hyperparameters; careful calibration is required for optimal balance (Szczepanski et al., 17 Sep 2025, Peng et al., 29 Oct 2025).
  • Complexity of Dynamic Scheduling: For generative diffusion models, the scheduling of patch sizes introduces new layers of inference-time complexity and potential instability, though LoRA adapters and residual skips help regularize training (Kim et al., 19 Feb 2026).
  • Information Loss in Extreme Compression: Aggressive token reduction (e.g., 10× in DRIP) yields non-negligible accuracy drops, highlighting a limit to achievable compression without substantial information loss (Peng et al., 29 Oct 2025).
  • Scalability of Pooling Functions: For large candidate patch grids or deep transformer stacks, clustering and importance metric computation may introduce additional overhead, and the scalability of assignment algorithms can become a bottleneck (Sun et al., 2024).

Future research is expected to focus on learned, end-to-end differentiable token repartitioning, improved interpretability of merging decisions, and extension to multimodal and temporally dynamic vision tasks leveraging the flexibility of dCTS paradigms.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Patch Merging (dCTS).