Token-wise Pruning and Eviction
- Token-wise pruning/eviction is a technique that selectively removes redundant tokens during neural inference to optimize compute resources.
- It leverages metrics such as attention saliency, KV similarity, and sensitivity scores to identify and remove low-importance tokens.
- This approach is applied across language, vision, and multimodal models, achieving significant latency and memory savings with minimal performance loss.
Token-wise Pruning/Eviction
Token-wise pruning (or eviction) refers to the selective removal of tokens during neural model inference or retrieval, with the goal of reducing computational and/or memory overhead without substantially degrading performance. It has emerged as a central paradigm in accelerating large-scale models in language, vision, and multimodal domains. Across tasks, token-wise pruning exploits token redundancy—either by skipping, removing, or compressing tokens at various stages of processing—using criteria ranging from attention statistics to optimization-based sensitivity scores. The diversity of techniques and settings underscores the importance of precise, online, and often training-free mechanisms for deployable, high-throughput systems.
1. Motivation and Theoretical Foundation
In transformer-based architectures (LLMs, ViTs, LVLMs), the primary bottleneck is the quadratic cost of self-attention with respect to sequence length . This motivates methods to prune tokens to reduce both compute and memory requirements—especially at inference, where context length or visual token count can reach thousands.
Historical approaches focused on offline or static selection using saliency profiling on calibration sets, or simple heuristics (e.g., stop-word removal, low attention weights). However, these static schemes overfit to the calibration data and fail to generalize across inputs or adapt to runtime context (Lee et al., 8 Dec 2025). The need for robust online pruning algorithms led to (a) adaptive, per-input token evaluation, (b) pruning criteria that reflect redundancy at runtime, and (c) skip/eviction mechanisms that minimally disrupt the model's representational power.
Mathematically, pruning decisions are often cast as structured pruning problems, lossless dominance checks, or sensitivity minimization, with theoretical guarantees for score preservation in retrieval or bounded output perturbation in generation (Zong et al., 17 Apr 2025, Gu et al., 9 Oct 2025).
2. Pruning Criteria: Redundancy, Saliency, Similarity
The principal challenge is to measure token importance so that non-critical tokens are pruned without incurring major accuracy loss. The field has converged on several broad classes of token-importance criteria:
- Key-Value Similarity Metrics (KV Similarity): Online methods such as Token Filtering (Lee et al., 8 Dec 2025) measure the cosine similarity between token key/value pairs and anchor vectors (running means), producing a redundancy score that reflects whether a token's representation is semantically covered by earlier tokens. Variance-aware fusion dynamically weights key and value similarity based on variance across attention heads, ensuring the criterion remains stable at high pruning ratios.
- Attention-based Saliency: Many methods derive an "attention mass" score for each token, either as the sum of attention received by the token across all heads (He et al., 2021), or self-/cross-attention in multimodal transformers (Guo et al., 18 May 2025). Tokens with low aggregate attention are deemed uninformative. Some approaches go further, using two-stage schemes: early visual self-attention pruning, followed by cross-modal (vision–text) attention pruning for task relevance (Guo et al., 18 May 2025).
- Transition-based and Sensitivity Criteria: Transition variation evaluates how much a token's embedding changes (e.g., in norm or direction) through transformer submodules (Li et al., 28 Jul 2025). Tokens with high transition are semantically active. Sensitivity-based methods utilize zeroth-order gradient approximation to assess how small perturbations in token features affect projection-layer outputs (Kim et al., 29 Sep 2025).
- Dominance and Lossless Guarantees: In late interaction IR (e.g., ColBERT), a token is pruned if it is "dominated"—i.e., for all possible queries, another token outperforms it in inner-product (Zong et al., 17 Apr 2025). Dominance is checked via LP feasibility, enabling theoretical guarantees of lossless retrieval.
- Auxiliary/Task-Specific Metrics: For dense prediction (segmentation, detection), per-token difficulty is assessed via auxiliary heads or foreground/background classification (Tang et al., 2023, Sah et al., 2024). Easy background or high-confidence tokens can be pruned or exited early.
3. Online, Adaptive, and Structured Pruning Algorithms
Token-wise pruning can be structured by where, when, and how the algorithm intervenes during inference:
| Method/Class | Pruning Timing | Key Mechanism |
|---|---|---|
| Token Filtering (Lee et al., 8 Dec 2025) | Per-layer (online) | KV similarity, adaptive threshold |
| STAR (Guo et al., 18 May 2025) | Multi-stage | Self- & cross-attention scoring |
| LazyLLM (Fu et al., 2024) | Stepwise (decode) | Dynamic attention scoring, aux cache |
| CoViPAL (Tang et al., 24 Aug 2025) | Pre-decoder | Contextual classifier, superv. attn |
| ZOO-Prune (Kim et al., 29 Sep 2025) | Projection layer | Zeroth-order sensitivity estimation |
| DToP (Tang et al., 2023) | Segmentation stages | Auxiliary confidence, top-k policy |
Algorithms may use hard pruning (physical token removal from the sequence), soft gating (masking or rescaling hidden states), or "skip-and-carry" (tokens not computed, but state carried forward unchanged (Li et al., 2024)). Pruning can be globally scheduled (fixed layer and ratio), layer-wise, or adaptively chosen per instance and per layer (Ye et al., 2024, Taniguchi et al., 12 Jan 2026).
Notable algorithmic frameworks include:
- Structured Linear Programs and Dominance (Zong et al., 17 Apr 2025): For lossless pruning in retrieval, LPs identify dominated tokens.
- Optimal Transport Pruning (Yang et al., 24 Mar 2025): Formulating token retention as minimizing reconstruction cost under transport constraints for compatibility with optimized attention kernels.
- Variance-aware Thresholding: Feedback control on per-layer skip ratios for stable budget adherence (Lee et al., 8 Dec 2025).
- Auxiliary Caching and Re-eviction: Retaining pruned token activations for possible later reentry, ensuring recoverability in hard contexts (Fu et al., 2024, Liu et al., 2023).
- Plug-and-Play and Training-Free Modules: Use of lightweight classifiers or greedy diversity to ensure fast, model-agnostic deployment (Tang et al., 24 Aug 2025, Kim et al., 29 Sep 2025, Liu et al., 1 Aug 2025).
4. Empirical Performance, Throughput, and Quality-Accuracy Pareto
Token-wise pruning delivers substantial reductions in computational cost—latency, memory, and FLOPs—while minimally degrading downstream accuracy, provided appropriate criteria and scheduling:
- LLMs: Token Filtering on LLaMA-2-13B (Lee et al., 8 Dec 2025) achieves up to 46.6% latency reduction and 33.6% memory saving at 50% prune, with 65.9% output accuracy (vs. 69.5% dense) and outperforming prior structured pruning methods at high sparsity.
- Late IR: Losslessly stores only 30–40% of original token-vectors with <1% in-domain and <3% out-of-domain retrieval drop (Zong et al., 17 Apr 2025).
- Vision-LLMs (LVLMs): STAR (Guo et al., 18 May 2025) and HiPrune (Liu et al., 1 Aug 2025) show 80–90% reduction of visual tokens with <2–3% accuracy loss, with HiPrune reaching 9× FLOP reduction at 5.6% tokens and retaining 92.5% accuracy. Training-free ZOO-Prune (Kim et al., 29 Sep 2025) prunes >90% of tokens with <5% relative accuracy drop across models.
- Dense Prediction: DToP (Tang et al., 2023) reduces 20–35% of compute at negligible segmentation mIoU loss. SViT (Liu et al., 2023) achieves 25–46% speedup, with <0.3 mAP loss on COCO detection/segmentation by preserving and reactivating pruned tokens.
- Long-Context LLMs: Adaptive Layer Selection (ASL) (Taniguchi et al., 12 Jan 2026) adaptively picks pruning layer according to rank stability, outperforming fixed-layer schemes on both accuracy and retrieval rates under tight KV budgets.
- Diffusion/Image Synthesis: DaTo (Zhang et al., 2024) integrates token pruning with feature caching for Stable Diffusion, obtaining up to 9× end-to-end speedup with no FID drop, even slight improvement, by using token dynamics as a selection metric.
Trade-offs between speed and accuracy are highly tunable via hyperparameters (pruning ratio, threshold; see reported ablations for per-task regimes). Layer- and instance-wise schedules, as in ATP-LLaVA (Ye et al., 2024), show substantial gains over global fixed pruning.
5. Design Challenges and Limitations
Key issues in token-wise pruning research relate to stability, generalization, and integration:
- Stability of Importance Metrics: Static, attention-only criteria may be unstable across layers/heads or susceptible to redundancy; joint criteria (e.g., KV similarity), transition-based, or sensitivity approaches improve robustness (Lee et al., 8 Dec 2025, Li et al., 28 Jul 2025, Kim et al., 29 Sep 2025).
- Generalization and Input Adaptivity: Calibration-free, online evaluation avoids overfitting and adapts to per-input context, crucial for strong generalization (Lee et al., 8 Dec 2025, Fu et al., 2024).
- Budgeting and Throughput: Algorithms require strategies to balance quality and efficiency—either via feedback controllers (adaptive thresholds), search-based sparsity scheduling (Li et al., 2024), or optimization-based (OT or minimal-divergence) criteria (Yang et al., 24 Mar 2025, Ye et al., 2024).
- Compatibility and Training-Free Integration: Several frameworks are designed to be drop-in compatible with advanced inference kernels (e.g., FlashAttention), require no extra training, and can be efficiently batched (Yang et al., 24 Mar 2025, Kim et al., 29 Sep 2025).
- Preservation, Reactivation, and Context Integrity: In tasks requiring dense outputs (e.g., detection/segmentation), preserving pruned tokens in the feature map is critical for downstream recovery and context. Reactivation of pruned tokens in later layers further boosts accuracy (Liu et al., 2023).
Limitations persist in maintaining fine-grained information at high pruning ratios, handling distributional drift or new domain shifts (unless online), and scaling to modalities such as video or 3D. Methods relying exclusively on fixed, unidimensional attention statistics can show bias or collapse, which hybrid approaches (spatial + saliency, or diversity + importance) aim to remedy (Ye et al., 2024, Kim et al., 29 Sep 2025). For applications requiring exact retrieval, as in DOMINANCE pruning, additional constraints may restrict practical pruning ratios (Zong et al., 17 Apr 2025).
6. Extensions, Comparative Analysis, and Future Directions
Current research directions focus on further adaptivity and hybridization:
- Joint Model Components Pruning: Simultaneous head and token pruning (unified online frameworks).
- Dynamic, Per-Step Pruning: Runtime token reentry (auxiliary cache, context revival), per-layer dynamic scheduling (Fu et al., 2024, Kummer et al., 2024).
- Sensitivity-based and Output-Aware Pruning: Integration of output derivatives (Optimal Brain Damage/Cache) to quantify actual output perturbation (Gu et al., 9 Oct 2025), zeroth-order sensitivity, or token-level gradient proxies (Kim et al., 29 Sep 2025).
- Plug-and-Play, Model-Agnostic Pruning: Lightweight classifiers, attention-guided plug-in, and training-free wrappers for broad deployment across architectures (Tang et al., 24 Aug 2025, Liu et al., 1 Aug 2025).
- Lossless and Near-Lossless Extensions: Further formalization of pruning with retrieval/performance guarantees (dominance, linear programming) (Zong et al., 17 Apr 2025).
- Instance/Learner-Adaptive Controllers: Moves towards meta-pruning—controllers that allocate layer, ratio, and method per input or task (Taniguchi et al., 12 Jan 2026, Ye et al., 2024).
- Expanded Modalities and Tasks: Research extending to LVLM video pipelines, dense pixel/voxel 3D applications, and more computationally constrained edge deployments (Zhang et al., 2024, Sah et al., 2024).
Open problems include theoretical bounds on information loss, unified token- and head-pruning frameworks, adaptive thresholds based on downstream feedback, and application to encoder–decoder and non-autoregressive architectures. The integration of richer token-level signals (e.g., gradient sensitivities, learned policies for skip/keep) and data-driven diversity for robust coverage remain promising frontiers.
Key references: (Lee et al., 8 Dec 2025, Zong et al., 17 Apr 2025, He et al., 2021, Tang et al., 24 Aug 2025, Sah et al., 2024, Yang et al., 24 Mar 2025, Fu et al., 2024, Guo et al., 18 May 2025, Li et al., 28 Jul 2025, Taniguchi et al., 12 Jan 2026, Ye et al., 2024, Kim et al., 29 Sep 2025, Liu et al., 2023, Li et al., 2024, Ye et al., 2024, Zhang et al., 2024, Liu et al., 1 Aug 2025, Tang et al., 2023, Gu et al., 9 Oct 2025).