Centrifugal Token Pruning Paradigm

Updated 9 December 2025

Centrifugal token pruning is a novel sparsification strategy that selects tokens in a near-to-far spatial order, balancing redundancy reduction with local detail preservation.
It employs a Buffering for Spatial Sparsity (BSS) criterion to modulate token similarities with spatial distances, ensuring contiguous token selection and efficient feature aggregation.
Empirical implementations like VLM-Pruner and SPViT demonstrate high accuracy retention and resource efficiency in vision-language and transformer models even at high pruning rates.

The centrifugal token pruning paradigm is a model sparsification strategy in which token selection proceeds in a near-to-far (or inside-to-outside) spatial order, fundamentally balancing redundancy reduction with fine-grained spatial coverage. It was proposed to address limitations of traditional importance-based and redundancy-minimizing pruning schemes that overlook local spatial structure in visual transformer models and multi-modal vision-LLMs. The paradigm is characterized by explicit control over the spatial expansion of selected token sets, often through a spatial buffer or weighting mechanism, and often incorporates downstream information fusion from pruned tokens. The centrifugal paradigm has been instantiated by frameworks such as VLM-Pruner for vision-LLMs and the soft-pruning SPViT for Vision Transformers, and demonstrates strong empirical advantages in accuracy-retention and resource efficiency under high pruning rates (Wu et al., 2 Dec 2025, Kong et al., 2021).

1. Paradigm Definition and Foundations

The centrifugal token pruning paradigm aims to select a token subset $S$ of size $R$ from the ground set $V$ ( $|V|=N$ ) that both maximizes informational diversity (by minimizing representational redundancy among selected tokens) and enforces a progressive, near-to-far spatial order in selection. In formal terms, given reduced hidden representations and learned spatial coordinates $p_i$ for tokens $i$ , centrifugal selection penalizes the early selection of spatially distant tokens through a Buffering for Spatial Sparsity (BSS) criterion. The BSS-weighted token similarity is defined as

$\widetilde M_{ij} = M_{ij} \cdot (1 + \lambda\, \bar{\delta}_i(S))$

where $M_{ij}$ is the cosine similarity in projected hidden space, $\lambda$ is a positive buffer weight, and $\bar{\delta}_i(S) = \min_{j\in S} \frac{D^{(sp)}_{ij}}{D_{max}}$ is the normalized spatial distance from candidate $i$ to the nearest selected token, with $D^{(sp)}_{ij}$ the Euclidean distance in token grid coordinates and $D_{max}$ the maximal observed grid distance.

The selection maximizes the objective

$F(S) = \sum_{i\in C} \min_{j\in S} \widetilde M_{ij}$

where $C = V\setminus S$ are current candidates. This objective is approximately submodular; greedy maximization yields an effective approximation (Wu et al., 2 Dec 2025).

2. Algorithmic Implementations

VLM-Pruner

VLM-Pruner realizes the centrifugal paradigm via a training-free, three-stage algorithm:

Pivot Initialization: Select $\kappa$ spatially and semantically diverse initial pivots in key space using a max-min strategy.
Buffered Parallel Greedy Selection: Iteratively add tokens to $S$ with high non-duplication scores and low BSS-modulated similarities to $S$ , processing candidates in batches and annealing selection thresholds. Candidate similarities are recalculated with the spatial buffer, prioritizing contiguous local expansion and penalizing early selection of remote tokens.
Similarity-Weighted Aggregation: After selection, for each discarded token $u$ , identify the most similar retained token $j^*(u)$ , cluster discarded tokens accordingly, and aggregate their hidden features back into the kept tokens using normalized similarity weights and a fusion coefficient $\beta$ (typically 0.3 for aggregation, 0.7 for preservation of the original feature).

The computational complexity per image includes $O(Nd)$ channel screening, $O(N^2q)$ similarity matrix computation, $O(N^2)$ spatial distance computation, and $O(N|S|)$ post-selection aggregation (Wu et al., 2 Dec 2025).

SPViT

SPViT applies a dynamic, attention-based, multi-head token selector as part of a soft-pruning pipeline:

Per-Head Scoring: For input $X\in\mathbb{R}^{N\times C}$ , each head $i$ encodes local/global features, producing soft keep/prune probabilities via scoring MLPs.
Head-Weighted Fusion: Final keep/prune score per token is an attention-weighted average across heads.
Soft Pruning/Package Token: Instead of hard removal, dropped tokens' features are aggregated into a "package token" by weighted averaging, ensuring residual information is retained and available in downstream layers.
Latency-Aware Training: Training incorporates a latency-sparsity loss and explicit per-block latency lookup to enforce target resource constraints, following a progressive, layer-to-phase selector scheduling for optimal accuracy-speed trade-off (Kong et al., 2021).

3. Buffering for Spatial Sparsity (BSS)

The BSS criterion serves as the central mechanism for spatial control in centrifugal pruning. By modulating hidden-state similarity with normalized spatial distance, BSS defers the selection of spatially distant tokens until local regions are sufficiently represented. Formally, for each candidate $i$ , the normalized spatial distance to $S$ is computed as

$\bar{\delta}_i(S) = \min_{j\in S} \frac{D^{(sp)}_{ij}}{D_{max}} \in [0,1].$

The BSS-modulated similarity $\widetilde M_{ij}$ amplifies similarity for distant candidates, biasing the greedy selection towards spatially proximate tokens and enforcing an inside-out growth pattern. As $\lambda \to 0$ , BSS reduces to classical redundancy-based selection; higher $\lambda$ enforces stronger near-to-far expansion, improving local detail retention at the expense of global diversity (Wu et al., 2 Dec 2025).

4. Comparison with Alternative Pruning Strategies

The centrifugal paradigm contrasts with token importance-based and classical redundancy-aware pruning (e.g., top- $k$ importance or min-sum diversity). Importance-only schemes disregard inter-token redundancy, potentially retaining many spatially redundant tokens, while non-spatial diversity methods can scatter selections over the input, missing fine object details. Centrifugal buffering provides explicit coverage control, balancing diversity with local completeness.

Soft pruning frameworks such as SPViT incorporate a centrifugal progression by adaptively increasing pruning across subsequent layers or blocks. Their soft aggregation mechanisms, such as the "package token," complement the centrifugal principle by ensuring that even pruned information influences final representations, further mitigating information loss under high sparsity (Kong et al., 2021).

5. Empirical Results and Efficiency

Empirical evaluations demonstrate that centrifugal token pruning achieves high sparsity with minimal performance degradation. For VLM-Pruner, with 88.9% pruning (keeping $R=64$ of $N=576$ tokens), the system retains 95.61% of full-model accuracy on LLaVA-1.5-7B across 9 benchmarks, outperforming DivPrune, DART, and FastV, which achieve 93–94% retention. Computationally, VLM-Pruner achieves up to 1.4 $\times$ end-to-end speedup and reduces FLOPs by $\approx$ 78% at this sparsity level. Metrics for GQA, POPE, OCRBench, and OK-VQA exhibit superior or comparable performance at identical pruning rates. Similar trends are observed across larger-scale vision-language and video-oriented models (Wu et al., 2 Dec 2025).

SPViT exhibits complementary results on standard vision transformer backbones. For DeiT-T, SPViT reduces GFLOPs by 31%, mobile-CPU latency by 41%, and FPGA inference time by 36%, while incurring only a 0.1% top-1 accuracy loss on ImageNet (Kong et al., 2021).

Model	Pruning Rate	Retain Acc.	Speedup	Backbone
VLM-Pruner (LLaVA)	88.9%	95.61%	1.4×	LLaVA-1.5-7B
SPViT (DeiT-T)	~31–43%	$\leq$ 0.5% loss	1.4–1.6×	DeiT-T/S, Swin-T

6. Applications and Deployment Considerations

Centrifugal token pruning is particularly suited to vision-LLMs, vision transformers, and edge deployment scenarios where computational resource constraints are acute. The approach enables real-time inference by reducing the number of tokens processed at each transformer stage without sacrificing critical spatial detail, essential for fine-grained recognition and localization tasks.

Runtime efficiencies are achieved via straightforward matrix operations without requiring unsupported operators (e.g., fast top- $k$ or complex sorting), facilitating implementation on mobile devices and FPGAs. Selector and aggregation overheads remain negligible compared to core transformer computation. Parameter configurations for image-based VLMs typically use $N=576$ tokens ( $24\times24$ ), $q=256$ principal channels, buffer coefficient $\lambda=0.5$ , and fusion coefficient $\beta=0.3$ (Wu et al., 2 Dec 2025, Kong et al., 2021).

7. Limitations, Extensions, and Future Directions

While centrifugal token pruning achieves strong results at high sparsity, coverage-vs-detail trade-offs remain. Higher buffer strengths ( $\lambda$ ) improve local detail but may marginally lower global coverage metrics (e.g., POPE F1). Extensions integrating spatial and semantic constraints, adaptive stage-wise aggregation, or learning-based selection/fusion criteria represent plausible directions for improved flexibility. Broadening applicability to non-visual modalities where spatial locality or analogous structured relationships exist is also supported by the generalized definition of BSS-weighted similarity.

The empirical coverage gaps of alternative pruning methods under severe sparsity, as exposed in head-to-head comparisons, suggest that centrifugal selection, with explicit spatial buffering and downstream fusion, sets a new state of the art for high-efficiency transformer inference in both unimodal and multimodal domains (Wu et al., 2 Dec 2025, Kong et al., 2021).