Dynamic Token Dropping (DTD)

Updated 24 November 2025

Dynamic Token Dropping (DTD) is a class of algorithms that dynamically remove tokens based on data-dependent metrics, reducing computational overhead while preserving output fidelity.
It employs methods like per-token scoring, threshold-based decisions, and early exit mechanisms to efficiently manage tokens in vision, language, multimodal, and generative models.
Empirical results demonstrate that DTD can achieve significant compute reductions—up to 90% in some cases—while incurring minimal losses in model accuracy and performance.

Dynamic Token Dropping (DTD) is a class of algorithms that dynamically judiciously remove tokens from Transformer-based models during training or inference, reducing computational costs while seeking to preserve or minimally affect output fidelity. Unlike static pruning or uniform token reduction, DTD mechanisms exploit data-dependent cues—such as token confidence, attention, or inter-frame redundancy—to identify uninformative tokens early in the computation graph and adaptively bypass unnecessary processing. DTD has seen successful deployment in vision, language, multimodal, and generative models, yielding significant acceleration and resource savings across diverse architectures (Liu et al., 2023, Patel et al., 17 Nov 2025, Tang et al., 2023, Ye et al., 2021, Hou et al., 2022, Arif et al., 2024, Liang et al., 24 Jan 2025, Chang et al., 2024, Xu et al., 2023).

1. Taxonomy and Fundamental Methodologies

DTD encompasses a set of methods, each characterized by its token evaluation metric, granularity of dropping, and reintroduction strategy. The archetype involves three main steps:

Per-token scoring: Auxiliary heads or internal activations (e.g., class confidence, attention weights, change over time) estimate the "importance" or "difficulty" of each token.
Drop decision: Using a parameterized threshold, mask, or learned/policy-driven rule, tokens deemed uninformative are dropped (i.e., excluded from subsequent modules).
Bypass or early exit: Dropped tokens are either routed directly to output heads, idled (frozen but preserved for possible later reactivation), or merged/cached for downstream restoration.

Variants include:

Early exit via confidence thresholds: Auxiliary segmentation/classification heads output per-token probability scores; tokens with maximum predicted class probability exceeding a threshold are dropped from further attention blocks (Liu et al., 2023, Tang et al., 2023).
Reinforcement learning–based token policy: A lightweight policy network, trained to maximize a tradeoff between output accuracy and compute cost, decides per-token early exit at multiple points (Ye et al., 2021).
Temporal dynamics and frame similarity: In video and diffusion models, cosine similarity between patch tokens across timesteps is used for online frame-wise redundancy-based dropping (Patel et al., 17 Nov 2025, Zhang et al., 2024).
Attention-guided budget allocation: In high-resolution or multi-partition images, global budgets are dynamically split among partitions based on aggregate attention metrics, and most informative tokens in each are retained (Arif et al., 2024).
Attention-distribution–driven rate adaptation: The global proportion of attention spent on visual tokens is monitored during generation, and used to dynamically adjust the rate of pruning via a predictor network (Liang et al., 24 Jan 2025).
Spatial and temporal schedule: In generative transformers, fixed or schedule-driven token pooling (not salience-based) is used for spatial and denoising-stage-wise dynamic density control (Chang et al., 2024).
Token idling versus permanent dropping: Instead of irreversible dropping, tokens are "idled" (preserved for later reappraisal) to alleviate error accumulation from early mispruning (Xu et al., 2023).

2. Detailed Algorithms and Mathematical Formulations

The following summarizes DTD core procedures in several application domains:

Vision Transformers for Semantic Segmentation

Token-pass decision via auxiliary heads: After a Transformer block, output token sequence $T^{\ell}\in\mathbb R^{N\times E}$ is reshaped to a feature map. A $1\times 1$ convolutional auxiliary head projects each token to $C$ class logits. Softmax over classes gives per-token class probabilities $p^{\ell}_{c,i,j}$ , and each token confidence is $q_{i,j}^\ell = \max_{c} p^{\ell}_{c,i,j}$ .
Thresholding and token passing: Tokens with $q_{i,j}^\ell$ exceeding a single threshold $\xi$ are dropped, and the mask is propagated multiplicatively to ensure monotonicity. Tokens that survive are passed to the next self-attention; dropped tokens are forwarded directly (bypassed) to the decoder (Liu et al., 2023).

NLP: BERT Pretraining and Inference

MLM loss–based dropping: During pretraining, a cumulative-loss vector is maintained for each vocabulary token, smoothed via $m_i \leftarrow \beta m_i + (1-\beta)\ell_i$ , where $\ell_i$ is the masked language modeling NLL. At each batch, tokens with top cumulative losses are kept, rest are dropped from middle Transformer layers, and later reintroduced for the final encoding layer to maintain sequence length (Hou et al., 2022).
Reinforcement learning–based DTD: States encode mini-batch token embeddings; actions are token-wise select/skip binary decisions. A policy network outputs per-token probabilities; policy gradient maximizes $\log p(y|X)$ less a compute penalty (Ye et al., 2021).

Video and Diffusion Models

Cosine similarity–based frame pruning: For video or temporal diffusion, patch token $x_i^t$ in frame $t$ is compared to $x_i^{t-1}$ using $s_i^t = \frac{x_i^t \cdot x_i^{t-1}}{\|x_i^t\|\|x_i^{t-1}\|}$ ; if $s_i^t$ exceeds threshold $\tau_\text{feat}$ , that token is dropped for frame $t$ (Patel et al., 17 Nov 2025, Zhang et al., 2024).
Temporal dynamics metric: In diffusion, the "DiffScore" of a token is the mean channelwise absolute change over adjacent steps: $\mathrm{DiffScore}(i) = \frac{1}{C}\sum_{c=1}^C |f_0^{\mathrm{up}}(x_{t+2})_{c,i} - f_0^{\mathrm{up}}(x_{t+1})_{c,i}|$ ; tokens with lowest "dynamism" are pruned (Zhang et al., 2024).

3. Implementation Trade-offs and System Considerations

Sparse attention efficiency: Computing attention only among "kept" tokens reduces per-block cost from $O(N^2 d)$ to $O(|M^\ell|^2 d)$ , where $|M^\ell|$ is number of surviving tokens at layer $\ell$ . Support for gather/scatter operations and batching is critical for throughput; masking and reconstruction is hardware friendly (Liu et al., 2023, Hou et al., 2022).
Drop-in and training-free variants: Some methods (e.g., video DTD, DaTo) require no retraining or fine-tuning, relying purely on online observed metrics (cosine similarity, DiffScore) (Patel et al., 17 Nov 2025, Zhang et al., 2024).
Auxiliary heads and light predictors: In segmentation and VLMs, auxiliary heads or small MLPs introduce negligible overhead (<1% parameters/FLOPs) for decision-making (Liu et al., 2023, Liang et al., 24 Jan 2025).
Gradual rate scheduling: Some DTDs use schedule-driven or predictor-driven adaptive drop rates as a function of model depth, denoising step, or generation progress, aligning compute allocation with phase-specific token salience (Chang et al., 2024, Liang et al., 24 Jan 2025).

4. Empirical Impact and Performance Analysis

Empirical benchmarks consistently demonstrate that DTD achieves significant compute, memory, and latency reduction:

Model/Domain	Compute Reduction	Speedup	Output Degradation	Reference
ViT-Base Segmentation	40–60% FLOPs	2× throughput	≤0.8% mIoU	(Liu et al., 2023)
BERT Pretraining	25% pretrain FLOPs	25% wall-clock	No GLUE, SQuAD loss	(Hou et al., 2022)
Stable Diffusion	7–9× runtime	7–9× throughput	–0.33 to –2.17 FID	(Zhang et al., 2024)
Video VQA (CacheFlow)	70–87% fewer tokens	1.4–6.6× memory	Matches/exceeds accuracy	(Patel et al., 17 Nov 2025)
High-res VLMs (HiRED)	80–90% tokens pruned	4.7× throughput	<3% accuracy loss*	(Arif et al., 2024)
VLMs (DyRate)	66% FLOPs (–33%)	1.5–1.75×	Matches/exceeds baselines	(Liang et al., 24 Jan 2025)
ImageNet ViTs (IdleViT)	24–48% MACs	22–64% speedup	<0.2% top-1 accuracy loss	(Xu et al., 2023)
FlexDiT (Diffusion)	38–55% FLOPs	69–175% throughput	+0.09 to –0.24 FID change	(Chang et al., 2024)

*Note: In HiRED, at 20% token budget, DocVQA accuracy drops 17% compared to full tokens. VQA-v2 and TextVQA losses are minimal.

A plausible implication is that the majority of tokens in high-dimensional vision/language tasks are either redundant or only locally informative, and their exclusion can be adaptively scheduled with marginal loss in global task performance.

5. Comparative Perspective and Limitations

DTD offers several advantages over alternative sparsification methods:

Versus static pruning/merging: DTD is input- and context-dependent, offering stronger trade-offs and being less subject to over-pruning of critical tokens (Zhang et al., 2024, Patel et al., 17 Nov 2025).
Versus pure memory-caching: Joint use of DTD with caching (e.g., DaTo, CacheFlow) avoids loss of diversity and recurrent over-smoothing typical of aggressive feature reuse alone (Zhang et al., 2024).
Versus permanent dropping: Idling or early exit with reintroduction or selection (e.g., IdleViT) can recover tokens erroneously dropped by earlier layers, mitigating error compounding (Xu et al., 2023).

Limitations documented include:

Static thresholds (e.g., fixed $\tau_\text{feat}$ ) may be suboptimal under rapid scene transitions (Patel et al., 17 Nov 2025).
Pruned tokens may miss rare or fast-evolving phenomena (CacheFlow), or accuracy can degrade on fine-grained tasks at very low token budgets (HiRED) (Arif et al., 2024).
Some scoring rules, such as cumulative MLM loss, assume context-agnosticity and may not generalize to all domains (Hou et al., 2022).
Idle-token strategies incur small memory penalties, and do not reduce memory footprint as aggressively as irreversible dropping (Xu et al., 2023).

6. Extensions, Generalization, and Future Directions

Research indicates DTD is extensible across architectures and tasks:

LLMs: DTD can be plugged into Transformers for sequence modeling, contextual drop, adaptive depth, or layer skipping; this is demonstrated for both pretraining (BERT) and downstream RL-tuned inference (TR-BERT) (Hou et al., 2022, Ye et al., 2021).
Vision and multi-modal tasks: DTD is effective for semantic segmentation, video understanding, and VLM decoding (Liu et al., 2023, Patel et al., 17 Nov 2025, Arif et al., 2024, Liang et al., 24 Jan 2025).
Diffusion and generative models: Token-level sparsification yields high acceleration with minimal or even negative FID loss when paired with dynamic recomputation and pooling/upsampling modules (Chang et al., 2024, Zhang et al., 2024).
Unsupervised and streaming scenarios: DTD in CacheFlow is fully training-free and operates streaming, making it suitable for live, long-range, or data-scarce deployments (Patel et al., 17 Nov 2025).
Combinatorial scheduling: Schedules for pruning ratios can be optimized offline via Pareto (NSGA-II) search, driven by reinforcement learning, or adaptively regressed during inference for fine-grained efficiency–utility trade-off (Zhang et al., 2024, Ye et al., 2021, Liang et al., 24 Jan 2025).
Reactivation or idling: Allowing idled tokens to be re-evaluated layerwise is a direction with evidence for improved recovery and semantic fidelity (Xu et al., 2023).

Anticipated future work may focus on:

Adaptive threshold scheduling for more robust pruning under non-stationary input,
Multi-modal signals (e.g., joint vision-audio or cross-modal token importance estimation),
Hierarchical or non-uniform block allocation strategies,
Integration with quantization, weight pruning, or other "structured" sparsity methods.

7. Summary Table: Core Techniques by Domain

Domain	Token Importance Measure	Drop/Keep Decision	Restoration/Output	Reference
Semantic Segmentation	Aux-head class confidence	Threshold	Decoded w/ full map	(Liu et al., 2023 Tang et al., 2023)
BERT Pretraining	Cumulative MLM loss	Top-M per batch	Merge originals in last	(Hou et al., 2022)
Video VQA/Streaming	Cosine sim to prev frame	Fixed threshold	Force min one keep	(Patel et al., 17 Nov 2025)
Diffusion/Image Gen	DiffScore (temporal abs diff)	Top-K patch/region	Copy via base tokens	(Zhang et al., 2024)
High-res VLMs (HiRED)	CLS attn, multi-partition	Partitioned budget	LLM projection	(Arif et al., 2024)
VLM Decoding (DyRate)	Attention dist. stats	Linear predictor+GS	Dynamic mask, per-step	(Liang et al., 24 Jan 2025)
Generic ViT, IdleViT	CLS class attn per token	Top-K per-layer	Idle & re-activate	(Xu et al., 2023)
FlexDiT (Diffusion)	N/A (uniform pool/restore)	Scheduled sparsity	Restoration, upsampling	(Chang et al., 2024)

All techniques highlight the central role of adaptively eliminating redundant or overconfident tokens to match model complexity with input instance complexity, thereby advancing scalable and efficient large-model inference and training.