DynamicViT: Adaptive Vision Transformer Pruning

Updated 22 May 2026

DynamicViT is a method for adaptive token sparsification in vision transformers, reducing computation by selectively pruning less informative tokens while preserving accuracy.
It integrates lightweight prediction modules that assess token importance at intermediate layers and progressively prune tokens based on set keep ratios.
DynamicViT achieves substantial efficiency improvements by reducing FLOPs and increasing throughput with minimal accuracy loss, and is optimized for hardware-friendly deployment.

DynamicViT refers to a class of methods for efficient vision transformers based on input-adaptive token sparsification: they dynamically estimate token importance at intermediate transformer layers and selectively prune less informative tokens for subsequent computation. The DynamicViT paradigm preserves accuracy while significantly reducing FLOPs and improving throughput. The term commonly denotes the framework originally presented by Rao et al., and also underpins further generalizations to spatially-adaptive computation in both transformers and hierarchical architectures (Rao et al., 2021, Rao et al., 2022).

1. Empirical Motivation and Core Observation

Analysis of standard ViT/DeiT architectures reveals that the final class prediction depends predominantly on a small subset of the input patch tokens; visualizations such as Grad-CAM or attention rollout demonstrate that most tokens have negligible impact on the final decision (Rao et al., 2021, Rao et al., 2022). This empirical sparsity of attention suggests that dynamically discarding redundant tokens during inference can eliminate unnecessary computation with minimal accuracy degradation. Because transformers natively handle variable-length sequences, dynamic token removal is not constrained by tensor reshaping requirements that occur in convolution-based networks.

2. DynamicViT Framework: Architecture and Mechanism

DynamicViT augments a pre-trained transformer backbone with lightweight, learnable "prediction modules" at several depths. These modules estimate token-level importance scores and make stagewise, input-conditional pruning decisions.

Prediction Module Computation:

Let $X \in \mathbb{R}^{N \times C}$ denote the current set of active token embeddings and $\hat{D} \in \{0,1\}^N$ the binary keep mask.

Compute local embeddings for each token: $z^{\text{local}}_i = \mathrm{MLP}_{\text{loc}}(x_i) \in \mathbb{R}^{C'}$ (typically $C' = C/2$ ).
Compute global context: $z^{\text{global}} = \frac{\sum_i \hat{D}_i\, \mathrm{MLP}_{\text{glob}}(x_i)}{\sum_i \hat{D}_i} \in \mathbb{R}^{C'}$ .
Concatenate: $z_i = [z^{\text{local}}_i \, \| \, z^{\text{global}}]$ .
Predict per-token keep probability: $\pi = \mathrm{Softmax}(\mathrm{MLP}_{\text{pred}}(Z)) \in \mathbb{R}^{N \times 2}$ .
Discrete pruning by Gumbel-Softmax sampling: $D = \mathrm{GumbelSoftmax}(\pi)_{*,1} \in \{0,1\}^N$ .

After each prediction stage, the cumulative mask is updated: $\hat{D} \leftarrow \hat{D} \odot D$ . Pruned tokens do not reappear in later stages. The backbone transformer blocks are run only on the surviving tokens, with attention computation masked accordingly.

3. Attention Masking and Progressive Hierarchical Pruning

In each designated sparsification stage, a fixed target ratio $\rho^{(s)}$ determines the number of tokens to keep. The importance scores are used to keep the top $\hat{D} \in \{0,1\}^N$ 0 tokens and mask out the rest.

Attention masking implementation:

Given queries/keys $\hat{D} \in \{0,1\}^N$ 1, generate the standard attention logits $\hat{D} \in \{0,1\}^N$ 2. Construct the binary attention mask: $\hat{D} \in \{0,1\}^N$ 3 Masked attention is computed as

$\hat{D} \in \{0,1\}^N$ 4

This ensures that pruned tokens neither contribute to nor receive information from other tokens, except for a self-loop to prevent numerical instability.

Hierarchical pruning is performed in $\hat{D} \in \{0,1\}^N$ 5 stages, typically with $\hat{D} \in \{0,1\}^N$ 6 for some base keep-ratio $\hat{D} \in \{0,1\}^N$ 7. The progressive nature allows smoother adaptation and mitigates accuracy loss relative to a one-shot pruning at early layers.

4. Training Procedure and Objectives

DynamicViT is trained by extending the cross-entropy loss with several auxiliary objectives, facilitating stable end-to-end optimization of both classification and the token pruning mechanism (Rao et al., 2021, Rao et al., 2022):

Classification loss: Standard cross-entropy at the output head.
Self-distillation loss: $\hat{D} \in \{0,1\}^N$ 8 matches the token features after sparsification with those in a corresponding non-pruned ("teacher") network for the surviving tokens.
Prediction-match KL loss: $\hat{D} \in \{0,1\}^N$ 9 aligns predicted class distributions between DynamicViT and its dense teacher.
Ratio supervision: $z^{\text{local}}_i = \mathrm{MLP}_{\text{loc}}(x_i) \in \mathbb{R}^{C'}$ 0 penalizes deviations from the target pruning ratio at each stage.

Backbone weights are initialized from a pre-trained ViT/DeiT model; prediction modules are trained from scratch. The full model is trained for 30 epochs (or longer for hierarchical variants), with Gumbel-Softmax sampling during training for differentiability and deterministic thresholding at inference.

5. Extension: Dynamic Spatial Sparsification in Hierarchical Models

DynamicViT has been generalized to hierarchical vision architectures and convolutional models via "dynamic spatial sparsification" (Rao et al., 2022). In these settings, the spatial structure of the feature map must be maintained. Instead of physically dropping patch tokens, each spatial location is assigned either a fast path (lightweight linear or bottleneck operation for redundant locations) or a slow path (full MLP or convolution for salient locations).

This asymmetric computation enables unstructured dynamic computation per spatial position while retaining the tensor shape required for downstream blocks.

6. Efficiency, Performance, and Practical Considerations

DynamicViT achieves substantial efficiency gains:

On DeiT-S with $z^{\text{local}}_i = \mathrm{MLP}_{\text{loc}}(x_i) \in \mathbb{R}^{C'}$ 1, DynamicViT prunes up to 66% of tokens hierarchically, reducing FLOPs by 31–37%, increasing throughput by 40–54%, and incurring <0.5% top-1 accuracy drop (Rao et al., 2021, Rao et al., 2022).
Hardware-friendliness is realized by always maintaining $z^{\text{local}}_i = \mathrm{MLP}_{\text{loc}}(x_i) \in \mathbb{R}^{C'}$ 2 tensors and masking attention, allowing standard CuBLAS/CuDNN batched kernels to be applied without gather/scatter operations.
DynamicViT consistently outperforms or matches static and structural sparsification as well as random or attention-score-based token removal. Only learned, dynamic predictors retain state-of-the-art accuracy–efficiency trade-offs.

A representative table summarizing key results:

Model/Setting	Top-1 Accuracy (%)	GFLOPs	Throughput (im/s)
DeiT-S, baseline (ρ=1.0)	79.8	4.6	1338
DynamicViT, ρ=0.7 (default)	79.3 (−0.5)	2.9	2062 (+54%)
LV-ViT-M, baseline (ρ=1.0)	84.0	12.7	(not specified)
DynamicViT-LV-M, ρ=0.7	83.8 (−0.2)	8.5	(not specified)

DynamicViT also applies to dense prediction tasks (segmentation/detection) and convolutional/hierarchical backbones, achieving sizeable FLOPs reductions and throughput gains with negligible accuracy loss.

7. Limitations and Prospective Extensions

DynamicViT reduces computational cost (FLOPs) and inference time but does not reduce model parameter count or memory footprint. The prediction modules themselves add a modest overhead. On resource-constrained hardware, further compression (e.g., channel pruning, quantization) may be required (Rao et al., 2022). Current deployments predominantly target classification; extensions to detection, segmentation, and video, as well as dynamic re-expansion or hybrid sparsification schedules, are prospective directions. Mechanisms to jointly optimize sparsification schedules and integrate with structured-pruning or neural architecture search could broaden DynamicViT's applicability.

Dynamic token sparsification remains a competitive and scalable approach for model acceleration in modern vision transformers and related architectures, with robust empirical validation and hardware-aligned implementation paths (Rao et al., 2021, Rao et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification (2021)

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DynamicViT.