Papers
Topics
Authors
Recent
2000 character limit reached

Adaptive Token Pruning for Transformers

Updated 21 December 2025
  • Adaptive Token Pruning (ATP) is a set of techniques that dynamically selects essential tokens from transformer models to optimize inference efficiency in vision and multimodal tasks.
  • ATP methods employ attention-based scoring, adaptive thresholds, and cross-modal signals to balance computational budgets with performance, achieving significant FLOP and latency reductions.
  • These strategies integrate plug-and-play with transformer backbones, enabling scalable, high-resolution multimodal applications and real-time performance on resource-constrained hardware.

Adaptive Token Pruning (ATP) is a collection of algorithmic strategies for dynamically reducing the number of tokens processed within transformer-based models, particularly Vision Transformers (ViTs), Vision-LLMs (VLMs), and Large Multimodal LLMs (MLLMs), during inference. The central objective is to efficiently select and retain only the most informative tokens, conditioned on the input data and task context, thereby substantially reducing computational and memory requirements while minimally impacting—sometimes even improving—model accuracy. Contemporary ATP approaches span attention-based heuristics, information-theoretic metrics, instance-adaptive gating, and multi-cue fusion, and are foundational to efficient large-scale deployment of transformer models across vision, language, and multimodal reasoning.

1. Motivations and Problem Formulation

The quadratic complexity of transformer self-attention with respect to token count renders high-resolution visual or multimodal sequences computationally expensive, especially in ViTs or LVLMs where the number of image or video tokens dominates the sequence length. Redundancy is well-documented: many tokens encode low-signal background information or repeat highly similar features. The formal ATP objective is, for a model fθf_\theta and budget BB, to select a subset of tokens (specified by a binary mask mtm_t) so that ∑tmt⋅ct≤B\sum_t m_t \cdot c_t \leq B, maximizing task performance (Wang et al., 28 Sep 2025, Li et al., 2022). ATP distinguishes itself from static or fixed-ratio pruning by adaptively selecting retention ratios, pruning thresholds, or token subsets in a data-driven, often instance- or layer-specific manner, and typically operates during inference without retraining the backbone.

2. Core Mechanisms and Algorithms

ATP frameworks employ diverse token importance metrics and selection mechanisms:

  • Attention-based scoring: Many ATP methods (e.g., SaiT (Li et al., 2022), AS-ViT (Liu et al., 2022)) utilize mean attention weights from the CLS token to each patch, sometimes weighted by head importance, to quantify token saliency. Tokens are then selected either by fixed top-kk, or adaptively by accumulating a mass threshold MthM_{\rm th} of total attention (Li et al., 2022).
  • Adaptive thresholds: Learnable or instance-adaptive thresholds (as in AS-ViT) enable dynamic token retention, optimized with FLOPs-aware loss terms, so that per-image, per-stage pruning self-tunes to hit accuracy and sparsity targets (Liu et al., 2022, Ye et al., 30 Nov 2024).
  • Hybrid and cross-modal cues: Recent methods fuse CLS attention with external or cross-modal signals, such as CLIP text-image similarity, to better align pruning with the task context (e.g. VQA) (Li et al., 14 Dec 2025). Convex combinations wi=αSi+(1−α)Aiw_i = \alpha S_i + (1-\alpha)A_i, where SiS_i is inter-modal, AiA_i intra-modal, allow flexible trade-offs between objectness and query relevance.
  • Dynamic token schedules: Instance- and task-complexity-aware retention curves are tuned based on estimated mutual information between visual and textual tokens, yielding sample-adaptive, non-uniform layerwise pruning schedules (Wang et al., 28 Sep 2025). Logistic curves parameterized by complexity indicators ensure both computational budget adherence and task-adaptive depth-wise pruning.

The implementation is often plug-and-play, requiring only selection logic or a lightweight router between model layers, working with standard ViT or LVLM architectures.

3. Notable ATP Techniques Across Research

Techniques reflect significant algorithmic diversity:

Method Importance Metric(s) Selection Mechanism Adaptivity Grain
SaiT (Li et al., 2022) Attention mass/score Value- or Mass-based Top-K Prune location, threshold
AS-ViT (Liu et al., 2022) Head-weighted class attention Learned thresholds per stage Hierarchical, per-image
ATP-LLaVA (Ye et al., 30 Nov 2024) Redundancy + spatial scoring Learned MLP-thresholds, SAP Per-layer, per-instance
AutoPrune (Wang et al., 28 Sep 2025) Mutual information (MI) Budgeted logistic curve Input- and layer-adaptive
AdaptInfer (Zhang et al., 8 Aug 2025) Text-to-text priors + cross-attn Layerwise dynamic guidance Per-layer, text-adaptive
AdaptPrune (Luan et al., 11 Mar 2025) Attention × spatial × similarity Soft non-max suppression Single-layer, multi-cue
CATP (Li et al., 11 Aug 2025) Alignment, diversity, context attn diff Greedy submodular, hybrid Multi-image, multi-stage
AdaTP (Sun et al., 26 May 2025) Debiased attention (global, local) Segment/position-aware Video, per-segment

Each method targets different settings (pure image, VQA, multimodal ICL, video), and employs a complementary set of cues for ranking and pruning tokens. Some, like AIM (Zhong et al., 4 Dec 2024), further combine merging with pruning for maximal redundancy removal.

4. Empirical Benefits and Performance

ATP consistently achieves substantial reductions in computation and memory with minor accuracy degradation:

  • SaiT (Li et al., 2022): Achieves 39–43% FLOP reduction, 67–91% throughput increase, <0.5% accuracy drop on ViT backbones using adaptive value/mass-based pruning.
  • AS-ViT (Liu et al., 2022): 50–97% throughput increase and <1.1% top-1 loss on ImageNet with 3 adaptive thresholds and head-weighted scoring.
  • ATP-LLaVA (Ye et al., 30 Nov 2024): Reduces token count by 75% (576→144), attaining 98.1% upper-bound accuracy (only 1.9% degradation) and a 78.1% drop in computation.
  • AutoPrune (Wang et al., 28 Sep 2025): At 89% token pruning, retains 96.7% of accuracy, surpassing PyramidDrop and related baselines by 9.1% in relative accuracy.
  • AdaptInfer (Zhang et al., 8 Aug 2025): Delivers 61.3% CUDA latency reduction while preserving 92.9% original performance on LLava-1.5-7B.
  • AdaptPrune (Luan et al., 11 Mar 2025): At 90% pruning, maintains >80% of baseline performance, outperforming FastV and closing the gap on 12 multimodal and OCR benchmarks.
  • CATP (Li et al., 11 Aug 2025): Removes 77.8% of tokens in multimodal in-context learning, with average +0.6% gain in ICL accuracy, and 10.78% reduction in latency, outperforming all prior baselines.
  • AdaTP (Sun et al., 26 May 2025): 27–28% of vanilla FLOPs is sufficient to achieve lossless or even slightly improved accuracy on video LLM benchmarks.
  • FTP (Li et al., 16 Dec 2024) (text-LMs): Achieves up to 1.6× speedup at 40% sparsity with negligible (<1%) accuracy loss, outperforming BlockPruner and ShortGPT by ~10 points retention.

These results highlight the cross-domain consistency of ATP in preserving or enhancing model performance under strong computational constraints.

5. Architectural and Hardware Considerations

Modern ATP implementations are agnostic to the transformer backbone (ViT, LLM, VLM, MLLM), but the design significantly impacts efficiency:

  • Layer and instance adaptivity: Layer-wise schedules are crucial, with early layers in vision backbones often requiring higher token retention. Instance-wise and sample-adaptive policies (ATP-LLaVA, AutoPrune) further optimize for the diversity of user queries and image complexity.
  • Hardware-awareness: Methods like HeatViT (Dong et al., 2022) target deployment on FPGAs. A specialized attention-based multi-head token selector is implemented with polynomial-function quantization and control logic for discrete keep/prune decisions, reaching up to 65.3% compute reduction and 3.46–4.89× FPGA speedup.
  • Plug-and-play: Many ATP schemes (SaiT, AdaptInfer, AdaptPrune, CATP) do not require backbone weight updates and are realized as inference-time, training-free modules between encoder and decoder stages, facilitating rapid integration into existing pipelines.

6. Limitations, Challenges, and Future Directions

Despite compelling efficiency gains, several challenges and areas for future research are noted:

  • Hyperparameter tuning: The trade-offs between speed and accuracy are controlled by pruning ratios, thresholds, and fusion factors, often requiring per-task or per-dataset tuning.
  • Task and domain alignment: In multi-image, video, or in-context learning scenarios, naïve per-image pruning may break cross-modal dependencies; research such as CATP and AdaTP explicitly address this using global and contextual cues.
  • Multi-layer and dynamic scheduling: While single-layer ATP is computationally simple, multi-layer adaptive pruning or schedules derived from input complexity (AutoPrune) offer finer trade-off control but introduce additional scheduling logic.
  • Qualitative robustness: ATP can enhance interpretability by enforcing focus on objects relevant to the query, and improves robustness to data corruptions or text distractors by suppressing spurious tokens (Li et al., 14 Dec 2025). Large-scale studies of interpretability and error cases remain open.
  • Extension to text and other modalities: While image token pruning is dominant, fine-grained token pruning for LLMs (FTP) and cross-modal scenarios (AdaptInfer) are nascent and active areas.

Future ATP research is likely to explore joint visual and textual token pruning, learned policies via reinforcement learning or meta-learning (e.g., PRANCE (Li et al., 6 Jul 2024) using PPO for combinatorial pruning/channel search), and integration with edge-efficient and hardware-adaptive transformer architectures.

7. Representative Empirical Tradeoff Table

To illustrate performance profiles, the following table summarizes select ATP results across notable models and settings:

Model Tokens Retained FLOPs ↓ Throughput/Latency ↓ Accuracy Retention
SaiT (ViT) (Li et al., 2022) ~40–50% 39–43% 67–91% ↑ –0.5% top-1
ATP-LLaVA (Ye et al., 30 Nov 2024) 25% (144/576) 78% 38.4% CUDA ↓ 98.1%
AutoPrune (LLaVA-1.5-7B) (Wang et al., 28 Sep 2025) 11% (64/576) 76.8% — 96.7%
AdaptInfer (Zhang et al., 8 Aug 2025) 11% (64/576) — 61.3% CUDA ↓ 92.9%
AdaptPrune (Luan et al., 11 Mar 2025) 10% (60/576) 81% ~1.6x latency ↑ 80–90% depending on task
CATP (Li et al., 11 Aug 2025) 22.2% 79% 10.78% latency ↓ +0.6% ICL gain

These results highlight that modern ATP methods can routinely operate in extreme token-sparsity regimes (≥80% tokens pruned) with only minimal accuracy loss and, at times, performance gains due to strategic removal of redundant or misleading signals.


Adaptive Token Pruning is now a central component of efficient transformer inference, enabling scaling of multimodal architectures to high-resolution, long-sequence, and real-time settings previously precluded by prohibitive compute and memory constraints. The field continues to advance with increasingly sophisticated, context- and data-driven strategies for selecting the essential tokens at every inference step.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Adaptive Token Pruning (ATP).