Progressive Token Pruning in Transformers

Updated 12 August 2025

Progressive token pruning is a dynamic method that scores and discards redundant tokens across transformer layers to reduce computational cost without sacrificing accuracy.
It leverages input-dependent metrics like attention probabilities and entropy, applying stagewise pruning to adaptively streamline processing.
The technique is applied in NLP, vision, and multimodal models, offering plug-and-play efficiency improvements and potential performance enhancements.

Progressive token pruning is a class of techniques for dynamically reducing the set of tokens processed by a model—typically transformers—across multiple layers or stages, based on input-dependent importance criteria. This approach aims to mitigate the quadratic computational and memory costs of self-attention by discarding, preserving, or merging tokens judged to be redundant or uninformative, with decisions made progressively throughout the model rather than via static, one-shot filtering. Variants have been developed for natural language, vision, multimodal, and LLMs, with progressive token pruning being central in frameworks designed for efficiency, memory savings, and sometimes even improved task performance.

1. Principles of Progressive Token Pruning

Progressive token pruning differs from static and weight-based pruning in several core aspects:

Dynamic, Input-Dependent Selection: Instead of removing model weights or input tokens up front, tokens are scored for importance on-the-fly (e.g., via attention probabilities, similarity metrics, entropy, or auxiliary modules) at each layer, and pruning or merging decisions are adapted as feature representations evolve (Wang et al., 2020, Kim et al., 2023, Li et al., 28 Jul 2025).
Stagewise or Layerwise Reduction: Pruning occurs at designated layers, often in several stages. Tokens removed early are not reconsidered in most frameworks, though some methods allow for later reactivation (Liu et al., 2023).
Heuristics and Optimization Criteria: Importance scores may incorporate not only self-attention weights, but also cross-modal attentiveness, visual entropy, token transition magnitude/direction, or submodular alignment/diversity maximization. The latter is especially relevant for preserving semantically or contextually distinct information in vision-language or in-context learning scenarios (Li et al., 11 Aug 2025, Li et al., 28 Jul 2025).
Progressive Quantization: Some approaches additionally adjust numerical precision in a progressive, token-wise fashion, leveraging the confidence in computed importance to trade off between DRAM bandwidth and computation (Wang et al., 2020).

This paradigm allows the token set processed by the model to shrink adaptively and non-uniformly, leading to reductions in runtime complexity, memory usage, and (potentially) energy consumption.

2. Methodologies Across Modalities

Several architectures and domains have adopted progressive token pruning, each refining the approach to suit their task and input structure.

2.1 NLP and LLMs

Cascade Token Pruning in SpAtten: Importance scores are accumulated from softmax-normalized attention probabilities across heads and layers. A global, layerwise top‑k operation discards the least important tokens in a cascade fashion such that pruned tokens remain excluded from all subsequent layers (Wang et al., 2020).
Parallel/Tree-Based Decoding: For LLMs, ProPD applies early pruning during parallel tree decoding. Candidate token branches not sufficiently supported by shallow-layer prediction heads are removed early, followed by dynamic generation of the verification tree for maximum parallel efficiency (Zhong et al., 21 Feb 2024).
Dynamic Token Retention and KV Cache Optimization: LazyLLM uses per-layer attention scores to compute importance, applies a top‑k percentile selection to determine pruning per layer, and manages an auxiliary cache to allow revival of previously pruned tokens, thereby avoiding reprocessing and preserving worst-case efficiency (Fu et al., 19 Jul 2024).
Saliency-Driven Hierarchical Pruning: SDTP deploys lightweight MLPs to mimic gradient-based saliency scores, producing stagewise binary masks that progressively reduce the token set, with loss terms aligned both in the value and the ranking order of token importance (Tao et al., 6 Apr 2025).
Early-Exit Vocabulary Pruning: Rather than pruning input tokens, some LLM frameworks prune the vocabulary used for softmax in early-exit decisions by selecting top-k candidates at early layers and restricting further softmax and confidence computations to these tokens (Vincenti et al., 24 Oct 2024).

Token Merging and Fusion: Prune-and-merge modules learn to combine (via learned matrices or slerp-type interpolations) redundant tokens, with reconstruct pathways enabling spatial information recovery and shortcut connections for unmerged (pruned) tokens (Mao et al., 30 Mar 2025, Kim et al., 2023).
Hybrid Pruning-Merging: ToFu combines early-layer aggressive pruning with later-layer token merging using average or MLERP spherical linear interpolation to preserve feature magnitude and direction, switching strategies progressively according to model linearity (Kim et al., 2023).
Statistical and Distribution-Based Pruning: FitPrune considers the divergence of self- and cross-attention distributions (before and after pruning) as the objective for determining pruning ratios, adjusting per layer using binary search to satisfy a computational budget (Ye et al., 16 Sep 2024).
Multi-Cue and Adaptive NMS: AdaptPrune injects spatial position, token similarity, and attention scores into an adaptive non-maximum suppression (NMS) framework, repeatedly selecting high-importance tokens and suppressing neighbors by spatial and feature similarity decay (Luan et al., 11 Mar 2025).
Prompt-Guided and Contextually Adaptive Schemes: Pruning guided by prompts or cross-modal semantic alignment/diversity can adaptively focus computational resources on regions of interest or essential context in segmentation and in-context learning, as in prompt-aware segmentation schemes (Dutta et al., 19 Jun 2025) and CATP for multimodal ICL (Li et al., 11 Aug 2025).
Token Transition and Transformation-Based Criteria: TransPrune evaluates tokens based on their feature transition (magnitude and direction) across transformer modules, combined with instruction-guided attention, aggregating scores at multiple pruning stages for robust selection (Li et al., 28 Jul 2025).
Multi-Stage Pruning Across Pipelines: METEOR aligns multi-encoder token budgets to feature rank, applies cooperative redundancy-based pruning after fusion, and finally adopts instance-adaptive, prompt-guided pruning in the decoding stage (Liu et al., 28 Jul 2025).

3. Algorithmic and Theoretical Foundations

3.1 Attention-Based Accumulation and Thresholds

Several techniques aggregate attention probabilities or importance signals over heads, queries, and/or layers to compute cumulative importance scores:

$p_i = \exp(s_i) / \sum_j \exp(s_j)$
$s_t[\text{token\_id}] += \text{attention\_prob[head][query][token\_id]}$

These scores are subject to global or percentile-based thresholds to select the set of tokens to retain.

3.2 Submodular Maximization and Diversity Maintenance

To preserve global information, especially in ICL or complex vision-language inputs, some frameworks maximally cover the feature diversity via facility location or submodular maximization objectives:

$Y^* = \arg\max_{Y \subset X, |Y|=k} \left[ F_{\text{div}}(Y) + \lambda_1 F_{\text{align}}(Y) \right]$

where $F_{\text{align}}$ is semantic alignment and $F_{\text{div}}$ measures setwise diversity.

3.3 Entropy-Based and Saliency-Weighted Scoring

Low-level visual or patch entropy is integrated into the importance score to prevent over-pruning in flat-information regions or to prioritize tokens with high visual content, e.g.:

$H(V_i) = -\sum_k p_k \log p_k$

Tokens are further ranked using entropy, semantic alignment, or attention change across layers.

3.4 Training Strategies: Noise Injection and Gradient-Weighted Scoring

Some methods relax the hard pruning mask into a continuous, noise-weighted process during training (e.g., TNT), aligning with information bottleneck and rate-distortion principles. Gradients of the loss with respect to attention matrices can serve as an importance proxy for guiding the learning of pruning structures (Rao et al., 27 Nov 2024, Mao et al., 30 Mar 2025).

4. Efficiency, Accuracy, and Empirical Results

Progressive token pruning consistently yields substantial gains in model efficiency with negligible or even negative accuracy loss due to elimination of redundant/noisy inputs:

Method / Domain	Main Speedup/Reduction	Accuracy Loss	Notes
SpAtten (NLP)	$10\times$ DRAM, $160\times$ – $347\times$ speedup	0 (typ.)	No accuracy loss on 30 benchmarks; cascade
LazyLLM (LLM)	$2.34\times$ TTFT (Llama 2 7B)	$\sim$ 0	Progressive, revives pruned tokens
SDTP (LLM)	$33$– $47\%$ FLOPs, $1.75\times$ spd	$\sim$ 0	Hierarchical, saliency-driven
SViT (Vision)	$34$– $46\%$ speedup	$<0.3$ mAP	Reactivation, dynamic rate
FitPrune (MLLM)	$54.9\%$ FLOPs	$0.5\%$	Statistical attention fitting
TransPrune (LVLM)	$>$ 50% TFLOPs reduction	$\sim 0$	TTV/IGA, progressive multi-stage
VFlowOpt (LMMs)	$90\%$ token prune, $3.8\times$ spd	$0-14\%$	Info-flow guided, with token recycling
CATP (ICL LVLMs)	$77.8\%$ tokens, $10.8\%$ latency	$\sim 0$ or $+0.6$ %	Semantic/diversity, stagewise, ICL focus
METEOR (Multimodal)	$76\%$ tokens, $49\%$ TFLOPs	$0.3\%$	Progressive, multi-stage, multi-encoder
LVTP (ViT, segmentation)	$20$– $45\%$ GFLOPS	$3$– $5\%$ mIoU	Edge-aware, entropy-driven

Progressive, multi-stage, and hybrid strategies are favored for maintaining accuracy at high pruning ratios, especially in dense prediction and complex multimodal tasks.

5. Integration and Deployment Considerations

Training-Free Applicability: Most recent frameworks (e.g., FitPrune, TransPrune, AdaptPrune, LazyLLM, CATP, VFlowOpt) are designed to be plug-and-play at inference, not requiring further model fine-tuning. This lowers barriers for practical adoption in existing deployed models.
Hardware and Algorithmic Compatibility: Methods such as SpAtten and Prune-and-Merge focus on matrix-friendly operations or are designed for hardware acceleration; others exploit on-the-fly O(n) ranking engines or cache management for worst-case performance guarantees.
KV Cache and Memory Footprint: In LLMs, reduction of KV cache size is explicitly targeted in methods like LazyLLM and VFlowOpt, yielding direct GPU and DRAM savings.
Interpretability and Debugging: Some approaches highlight or visualize retained/pruned tokens, facilitating model inspection and understanding of importance attribution.
Instance and Layer Adaptivity: Dynamic, context-aware, and task-specific thresholds and pruning ratios can be set via statistical, entropy-based, or optimization-derived criteria, improving model robustness across input complexity.

6. Extensions and Future Directions

Integrating progressive token pruning with quantization and other memory/computation reduction techniques (progressive quantization, early-exit confidence estimation, speculative decoding) remains a promising avenue for compounding efficiency gains (Wang et al., 2020, Fu et al., 19 Jul 2024).
The combination of token transition dynamics (e.g., TTV), context-aware alignment/diversity, and plug-and-play statistical fitting offers powerful templates for pruning in expanding domains such as video, high-resolution vision, and multi-turn dialog in LVLMs.
Increasingly, research highlights the practical need for progressive, multi-stage pruning (encoding–fusion–decoding, or input–attention–output), especially for multi-encoder and in-context scenarios with complex or adaptive token budgets (Liu et al., 28 Jul 2025, Li et al., 11 Aug 2025).

A plausible implication is that progressive token pruning will be foundational for scalable deployment of transformer-based models, particularly as models continue to expand in parameter and context length, and as new modalities and interaction structures emerge.