Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learned Token Pruning in Transformers

Updated 22 January 2026
  • Learned Token Pruning (LTP) is a method for enhancing transformer efficiency by selectively dropping uninformative tokens during inference.
  • LTP employs learned thresholds, top-k selection, and router modules to assess token importance and reduce computational cost.
  • Empirical studies across NLP, vision, and retrieval models demonstrate up to 2.1× FLOPs reduction with minimal accuracy loss.

Learned Token Pruning (LTP) is a family of methods designed to improve the computational efficiency of transformer-based architectures by selectively dropping tokens that are deemed uninformative or redundant during inference. LTP leverages either learned thresholds, data-driven gating networks, or formal optimization criteria to dynamically decide which tokens to retain or prune at each layer of the model, with the goal of reducing memory, computation, and, in certain cases, storage requirements while maintaining model effectiveness. It has seen applications across natural language processing, vision, retrieval, and multi-modal models.

1. Formal Principles and Taxonomy

LTP methods are fundamentally characterized by three elements: importance scoring, pruning mechanism, and learning or regularization strategy.

  • Token Importance Scoring: Each token's “importance” is measured using a criterion derived from attention weights, intermediate activations, or geometric dominance in embedding space. For instance, attention-based models may aggregate attention scores directed toward each token (e.g., the amount attended by the [CLS] pooler or column-wise means in the softmax) (Kim et al., 2021, Bonnaerens et al., 2023, Yang et al., 2023); state space models use the mean of positive activations across SSM output channels (Zhan et al., 2024); late-interaction IR models employ geometric notions of “domination” in representation space (Zong et al., 17 Apr 2025).
  • Pruning Mechanism: A mask or gating function is computed per token (often per layer), either via explicit thresholding (scalar or learned, possibly per-layer), top-k selection, or a learned gating module (e.g., a router MLP (Li et al., 2024)). The mask is then used to drop, zero out, or bypass computation for pruned tokens in subsequent model stages.
  • Learning or Regularization: Thresholds or gating functions are typically trained end-to-end. Approaches include joint optimization of model weights and thresholds with a loss function augmented by a sparsity or efficiency regularizer (Kim et al., 2021, Bonnaerens et al., 2023), differentiable surrogates for step functions to enable backpropagation (e.g., straight-through estimator, hard-concrete (Li et al., 2023)), or auxiliary objectives such as distillation or ranking consistency across layers (Li et al., 2023, Zong et al., 17 Apr 2025).

2. Algorithmic Instantiations and Mathematical Formulation

LTP has appeared in several architectural and algorithmic contexts:

  • Threshold-Based Pruning: At layer ℓ, the importance score Aj()A^{(ℓ)}_j for token jj is compared to a threshold TT_ℓ; the token is kept if Aj()TA^{(ℓ)}_j \geq T_ℓ (Kim et al., 2021, Bonnaerens et al., 2023, Yang et al., 2023). The threshold TT_ℓ is learned via a differentiable surrogate such as

m~j()=σ(κ(Aj()T))\tilde{m}^{(ℓ)}_j = \sigma(\kappa(A^{(ℓ)}_j - T_ℓ))

with steepness κ1\kappa \gg 1. During inference, a hard mask is applied.

  • Top-K/Ranking-Based Pruning: Sort tokens by their importance score and retain a fixed number of top tokens at each layer. Used in methods incorporating hard-concrete L0L_0 relaxations for ranking masks (Li et al., 2023), or in certain SSM-based approaches with tunable keep-rates (Zhan et al., 2024).
  • Router-Based Token Skipping: Fine-grained, token-wise routers (small MLPs) output per-token keep/drop probabilities, taking as input low-dimensional statistics such as position, attention score, rank, and block sparsity target (Li et al., 2024). The routers are trained using a mixture of guide, sparsity, and distillation losses to respect resource budgets while preserving accuracy.
  • Lossless Pruning by Dominance: In late-interaction retrieval models (e.g., ColBERTp_p), a subset of document token representations D+DD^+ \subset D is sought such that

ColBERTp(Q,D)=ColBERTp(Q,D+)\mathrm{ColBERT}_p(Q, D) = \mathrm{ColBERT}_p(Q, D^+)

for all queries QQ; lossless removal is guaranteed through an LP test for geometric dominance (based on Farkas' lemma), and regularization reshapes the token geometry to maximize the number of dominated (removable) tokens (Zong et al., 17 Apr 2025).

  • Reinforcement Learning/Games: The token-pruning policy is treated as a sequential multi-agent Markov Game, with agents representing tokens making keep/prune decisions at each layer; rewards balance accuracy and computational savings, optimized via MAPPO (Lu et al., 30 Mar 2025).

3. Experimental Results and Efficiency–Effectiveness Analysis

Empirical studies consistently show that LTP enables substantial reductions in computational cost, often with minimal accuracy trade-off.

  • NLP and Code Tasks: On GLUE, LTP achieves up to 2.1×2.1 \times FLOPs reduction (transformer self-attention plus feedforward) with less than 1% drop in task accuracy (Kim et al., 2021), outperforming prior static or fixed-schedule pruning approaches. In source code classification, LTP in SparseCoder delivers 4×\times speedups over dense baselines and halves FLOPs with <1% performance drop (Yang et al., 2023).
  • Vision Transformers: On ImageNet with DeiT-S backbone, learned-threshold LTP obtains 76.5% top-1 accuracy at 0.5×\times FLOPs (vs. 79.9% at full cost), exceeding uniform top-k methods and matching or surpassing DynamicViT, EViT, SPViT at a single fine-tuning epoch (Bonnaerens et al., 2023). RL-pruned ViTs yield 1.4–1.6×\times speedup with less than 0.5% accuracy loss at 30–40% token sparsity (Lu et al., 30 Mar 2025).
  • State Space Model Vision Models: SSM-specific LTP achieves 41.4% FLOPs reduction with only 0.6% drop in ImageNet top-1 accuracy; generic ViT-pruning methods fail catastrophically on SSMs, highlighting architectural dependencies (Zhan et al., 2024).
  • Retrieval Models: Lossless LTP in ColBERTp_p retains <1.5% in-domain effectiveness drop at 70% token reduction (Zong et al., 17 Apr 2025).
  • LLMs: Router-based LTP for LLMs (e.g. LLaMA2-7B) yields ~1.4–1.6×\times inference speedup at >20% token sparsity, retaining >98% of accuracy, surpassing existing token/block pruning baselines by ~10 absolute points in accuracy retention (Li et al., 2024).
  • Multi-modal/Task-oriented Segmentation: Task-guided LTP in segmentation achieves 25–40% FLOPs reduction in ViTs with <1% mIoU loss; language-conditioned relevance decoders retain task-critical tokens missed by vision-only heuristics (Chen et al., 2024).

4. Architectural Variants and Modal-Specific Adaptations

LTP methods are attuned to the properties of each underlying model family:

  • Transformers (Text/Code/Vision): Significantly benefit from LTP due to quadratic cost in sequence length. LTP exploits sparsity in attention and, with minimal architectural overhead, dynamically adapts to input content and sequence length (Kim et al., 2021, Yang et al., 2023, Bonnaerens et al., 2023).
  • Vision State Space Models (Mamba, ViM): Require pruning mechanisms that preserve the sequential recurrence structure. LTP for SSMs maintains explicit hidden-state alignment, guaranteeing accurate propagation despite sparse token sets—a requirement unnecessary for ViTs (Zhan et al., 2024).
  • Late-Interaction Retrieval: Geometric dominance-based LTP can guarantee zero loss in retrieval score, at some cost to runtime, or yield efficient approximate heuristics (Zong et al., 17 Apr 2025).
  • LLMs (token-wise skipping): Fine-grained routers trained atop frozen LLMs, using low-dimensional cues and a search-based sparsity schedule, allow for per-block, per-token computation skipping without retraining the base model (Li et al., 2024).
  • Multi-modal Integration: LTP may explicitly condition on non-visual modalities. For example, VLTP fuses language guidance with image tokens to score patch relevance for task-oriented vision applications, outperforming vision-only approaches in task alignment (Chen et al., 2024).

5. Advanced Regularization, Surrogate Losses, and Training Protocols

The optimization of LTP policies leverages several advanced techniques:

  • Hard-Concrete Relaxation: Used for L0L_0-style regularization to make binary gating differentiable in ranking-based token selection (Li et al., 2023).
  • Ranking Distillation: Early-layer token importance scores are distilled to align with the final-layer rankings of a dense teacher, directly optimizing NDCG via pairwise losses (Li et al., 2023).
  • SVD/Nuclear Norm/Span Collapse: In lossless LTP for IR, regularizers make token representations low-rank, increasing the proportion of geometrically dominated tokens—making more tokens removable with no retrieval loss (Zong et al., 17 Apr 2025).
  • Budget-aware and FLOPs Objective: Training losses often combine task loss with explicit penalties or surrogates for divergence from user-specified or budgeted FLOPs reduction targets (Bonnaerens et al., 2023, Li et al., 2023).
  • No/Minimal Backbone Training: Some LTP methods (especially routers for LLMs and vision) train only the routing, threshold, or pruning module and freeze the pre-trained backbone, enabling rapid fine-tuning (Bonnaerens et al., 2023, Yang et al., 2023, Li et al., 2024).

6. Challenges, Architectural Limitations, and Future Directions

Despite robust empirical results, LTP faces several open challenges:

  • Inference in Batched Environments: Many LTP schemes generate per-input sequence lengths post-pruning, impacting efficient batching; aligning to the longest sequence or padding may trade off either latency or computational savings (Bonnaerens et al., 2023).
  • GPU/Hardware Efficiency: Despite theoretical FLOPs reduction, practical speedups, especially on GPUs, may be limited by nonoptimal kernel implementations and memory throughput bottlenecks (Li et al., 2023).
  • Generalization Beyond Vision and Language: Architecturally maladapted pruning (naive ViT-style strategies applied to SSMs) leads to drastic accuracy degradation unless hidden-state dynamics are preserved (Zhan et al., 2024).
  • Exactness vs. Efficiency: Strong guarantees, such as lossless pruning by geometric dominance, may be expensive at inference time, suggesting future potential for amortized, approximate, or differentiable dominance surrogates (Zong et al., 17 Apr 2025).
  • Multi-modal Reasoning: Vision-language integration reveals that relevance scores must be conditioned on guidance to maintain task alignment under high pruning rates; failure to do so leads to sharp performance drop (Chen et al., 2024).
  • Scalability for LLMs and Industrial-Scale Sequences: Token routing at scale in LLMs is tractable with four-dimension summaries and guiding losses, avoiding the prohibitive cost of per-token, per-layer dense gates (Li et al., 2024).

7. Comparative Evaluation and Empirical Benchmarks

A cross-domain synthesis of key empirical results for LTP variants:

Model Domain FLOPs Reduction Speedup Accuracy Loss Key Reference
BERT-Base (NLP) 2.1× 1.9–2.0× <1% (GLUE) (Kim et al., 2021)
DeiT-S (Vision) 2× (FLOPs) 1.4–1.6× 0.4–1.3% (ImageNet) (Bonnaerens et al., 2023, Lu et al., 30 Mar 2025)
ColBERTp_p (IR) 3× (tokens) <1.5% (in-domain) (Zong et al., 17 Apr 2025)
SSM Vision 1.43× 1.4× 0.6% (ImageNet) (Zhan et al., 2024)
LLaMA2-7B (LLM) 1.2–1.6× 1.4–1.6× 0–2% (Li et al., 2024)
Segmentation 1.3–1.7× ≤1% (mIoU) (Chen et al., 2024)

The cited works employ either threshold-based, routing, geometric, RL, or ranking-based LTP, with empirical conditions carefully controlled for baseline architecture and input size.


LTP encapsulates a principled, extensively tested suite of algorithms for adaptive efficiency in token-based deep learning models. Its continued evolution targets broader architectural applicability, stronger theoretical guarantees, hardware-friendly implementation, and robust maintenance of downstream performance.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learned Token Pruning (LTP).