VLM-Pruner: Efficient Vision-Language Token Pruning

Updated 9 December 2025

The paper introduces a centrifugal token pruning strategy that discards redundant tokens to reduce the quadratic complexity in vision-language models.
It uses a Buffering for Spatial Sparsity (BSS) approach to ensure dense sampling of critical object regions and preserve spatial fidelity.
The method achieves significant speedups (up to 1.6×) and remarkable FLOPs reduction (up to 77.91%) without additional training.

Vision-LLM Pruner (VLM-Pruner) refers to a set of algorithmic frameworks and methodologies aimed at reducing the computational cost of vision-LLMs (VLMs) through the structured, principled elimination of redundant visual tokens, and in some cases, entire transformer layers. These methods address the quadratic complexity of VLM self-attention by discarding non-informative or redundant information while preserving the capacity for fine-grained multimodal reasoning. Several distinct paradigms of VLM-Pruner have emerged, ranging from graph-based token similarity pruning and spatial redundancy buffering to meta-routing and complexity-adaptive, sample-conditioned polices.

1. Fundamental Principles and Motivations

The principal motivation for VLM-Pruner frameworks is the substantial computational overhead imposed by the large number of visual tokens processed within VLM decoders. For high-resolution images, this can include upwards of thousands of patch tokens, generating an $O(N^2)$ burden in self-attention and cross-attention mechanisms. Existing approaches to token pruning, particularly those relying only on importance scores, are susceptible to two main limitations: excessive selection of semantically duplicated tokens (overfitting to object centers) and inadequate coverage of critical peripheral or fine-grained object regions, as purely redundancy-aware methods often disperse selections non-geometrically ["VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm" (Wu et al., 2 Dec 2025)].

The VLM-Pruner paradigm seeks to balance three key desiderata:

Redundancy reduction: Discourage selection of semantically or spatially overlapping tokens.
Spatial sparsity modulation: Encourage dense sampling of critical object regions while deferring widely separated—often non-informative—areas.
Information recovery: Mitigate loss incurred by pruning through aggregation or feature fusion from discarded tokens.

This leads to a centrifugal selection approach: selection proceeds from a small set of semantically diverse pivots, expanding outward in space and feature dimensions in a manner modulated by both semantic similarity and physical token proximity (Wu et al., 2 Dec 2025).

2. Centrifugal Token Pruning Paradigm and Buffering for Spatial Sparsity (BSS)

The VLM-Pruner pipeline is structured in three main stages:

Pivot Initialization: A small set $\kappa$ of maximally separated "pivot" tokens is selected via a max–min strategy in key space, ensuring initial semantic diversity in the retained set. For each step $t$ :

$j_t = \arg\max_{j\in \mathcal{C}} \min_{j' \in S_{t-1}} \lVert K_j - K_{j'} \rVert_2, \quad t=2,\dots,\kappa$

with $S_t$ being the set of selected pivots and $\mathcal{C}$ the set of candidates.

Greedy Expansion with BSS: Remaining tokens are added via a non-duplication score that fuses cosine-similarity and normalized spatial distance. For each candidate $i \notin S$ :

Spatial normalization: Compute

$\delta_i(S) = \min_{j \in S} D^{(sp)}_{ij}, \qquad \bar{\delta}_i(S) = \frac{\delta_i(S)}{D_{max}}$

where $D^{(sp)}_{ij}$ is Euclidean grid distance between tokens $i$ and $j$ .
BSS-modulated similarity:

$\widetilde{M}_{ij} = M_{ij}(1 + \lambda \bar{\delta}_i(S)),\quad \lambda>0$

(where $M_{ij}$ is the cosine similarity between compressed channel tokens; $q$ is chosen by top variance).
Selection threshold: Accept batch elements for which

$\max_{j\in S} \widetilde{M}_{ij} < \tau^{(t)}$

with progressive annealing of $\tau^{(t)}$ .

Similarity-Weighted Aggregation: Discarded tokens are clustered to their most similar kept token, and their features are aggregated back using similarity-weighted averages:

$H_j \leftarrow \beta H_j + (1-\beta)E_j$

where $E_j$ is the weighted sum of hidden states of assigned discarded tokens.

The BSS criterion explicitly penalizes selection of spatially distant tokens early in the process, ensuring local density of preserved tokens and preventing the loss of fine object details ["VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm" (Wu et al., 2 Dec 2025)].

3. Algorithmic and Implementation Structure

The VLM-Pruner algorithm is instantiated as a plug-and-play module, operating on the key/hidden states of visual tokens at a fixed decoder layer (e.g., the second layer). The primary complexity is located in the O( $N^2 q$ ) computation of channel-screened similarities and the loop-based expansion using the BSS criterion.

Selection proceeds in parallel batches, with per-batch updates to nearest-distances and similarities. Empirically, the number of loops required for full expansion is limited by threshold annealing. Final feature fusion restores object boundaries and mitigates peripheral feature loss.

The algorithm is resolution- and architecture-agnostic, functioning on both fixed- and variable-sized token sets—including support for video VLMs using spatio-temporal token grids.

4. Empirical Performance and Comparative Evaluation

VLM-Pruner has been evaluated across five vision-LLM architectures, including:

LLaVA-1.5-7B, 13B
LLaVA-Next-7B
Qwen2-VL-7B-Instruct
LLaVA-Video-7B

At an extreme 88.9% pruning ratio (retaining only 11.1% of visual tokens), VLM-Pruner consistently achieves state-of-the-art or superior trade-offs on 13 image/video benchmarks. For example, with 64 retained tokens on LLaVA-1.5-7B, VLM-Pruner attains 95.61% of the upper-bound accuracy, compared to 93.68% for DivPrune and 92.71% for DART.

Speedups are substantial: On LLaVA-1.5-7B, POPE latency drops from 33'21" to 23'59" (1.39×, FLOPs ↓77.91%). For Qwen2-VL-7B, end-to-end time is reduced by a factor of 1.57× at 11.1% visual token retention, with best-in-class OCR accuracy for the same speed.

Ablation studies identify that max–min pivot seeding, spatial BSS penalty in the expansion, and final feature aggregation each individually contribute 0.2–1.2 pp of the absolute performance difference relative to baselines. The design is robust to the screening channel dimension and batch size, with default choices near-optimal across models and input resolutions (Wu et al., 2 Dec 2025).

5. Methodological Context and Relation to Other VLM Pruning Strategies

VLM-Pruner outperforms and conceptually extends several categories of prior VLM token pruning approaches:

Approach	Redundancy-aware	Spatial-aware	Feature Aggregation	Training Required
FastV, SparseVLM	–	–	–	No
DART, DivPrune	✓	–	–	No
LVPruning	– (importance)	–	–	Yes (module only)
FoPru	– (attention)	–	–	No
VLM-Pruner	✓	✓	✓	No

Standard importance-based (FastV, SparseVLM) and attention-based (LVPruning, FoPru) methods sacrifice coverage by over-focusing on salient tokens. Diversity-based (DART, DivPrune) methods maximize representation spread but can under-represent core object regions due to insufficient locality control. VLM-Pruner's BSS buffering uniquely guarantees both semantic diversity and spatial fidelity without fine-tuning or retraining.

6. Scalability, Complexity Analysis, and Deployment

VLM-Pruner's time complexity is dominated by $O(N^2 q)$ similarity computation and $O(N \cdot R \cdot \#loops)$ for expansion, where $q$ is the channel screening dimension, $N$ number of original tokens, and $R$ the retention budget. Preprocessing spatial distances and parallellized expansion mitigate run-time cost. Integration is feasible as a pre-attention module at practical inference points.

Deployment on mobile platforms is enabled by:

Minimal additional FLOPs (screening + selection)
Non-backpropagating nature (training-free)
Flexibility in pruning ratios and per-sample adaptation
Out-of-the-box compatibility with dynamic-resolution or video tokens

By exploiting redundancy and spatial locality, mobile devices achieve 1.4–1.6× end-to-end speedups, significant memory reduction, and minimal accuracy trade-off at high pruning rates (Wu et al., 2 Dec 2025).

7. Limitations, Open Questions, and Future Research

Key limitations include:

Occasional information loss in ultra-sparse settings despite feature aggregation
Overheads in very high-resolution or spatiotemporal settings due to $O(N^2)$ scaling, suggesting a need for further optimizations such as hierarchical or block-wise selection

Potential extensions include multi-stage or adaptive layer-wise pruning, joint optimization with quantization or distillation, and generalization to textual or joint-modality token pruning.

Unaddressed challenges remain in fully aligning per-sample optimality (as targeted by frameworks such as AutoPrune (Wang et al., 28 Sep 2025)) and fully dynamic, task-conditional pruning while maintaining architectural generality and training-free integration.

Primary Reference: "VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm" (Wu et al., 2 Dec 2025)