Buffering for Spatial Sparsity (BSS)

Updated 9 December 2025

Buffering for Spatial Sparsity (BSS) is a pruning technique that reweights token similarity using normalized spatial distances to balance redundancy reduction with spatial coverage.
It employs a centrifugal, parallel greedy selection algorithm with channel screening and selective feature fusion, dynamically adjusting thresholds for orderly token selection.
Empirical evaluations demonstrate that BSS retains over 95% accuracy at extreme sparsity while achieving significant inference speedups across various vision-language models.

Buffering for Spatial Sparsity (BSS) is a criterion introduced to address the challenge of efficiently pruning visual tokens in vision-LLMs (VLMs) while maintaining both redundancy reduction and adequate spatial coverage of semantic content. BSS operates within the VLM-Pruner framework, a training-free, centrifugal (near-to-far) token pruning algorithm that employs a parallel greedy strategy and selective feature fusion to achieve high inference speed with minimal performance degradation at extreme sparsity (Wu et al., 2 Dec 2025).

1. Mathematical Formulation of Buffering for Spatial Sparsity

Suppose an image yields a feature map of size $H \times W$ , resulting in $N = H \cdot W$ visual tokens indexed by $i \in \{0,\dots,N-1\}$ . Each token $i$ contains a $d_k$ -dimensional key vector $K_i \in \mathbb{R}^{d_k}$ and a $d$ -dimensional hidden state $H_i \in \mathbb{R}^d$ . The algorithm first projects $H$ onto its $q$ highest-variance channels (channel screening), producing $\tilde H \in \mathbb{R}^{N \times q}$ .

The pairwise cosine similarity is

$M_{ij} = \frac{\tilde H_i^\top \tilde H_j}{\|\tilde H_i\|_2 \|\tilde H_j\|_2}$

Define each token's 2D grid coordinate $p_i = (x_i, y_i)$ , with $x_i = i \bmod W$ , $y_i = \lfloor i/W\rfloor$ . The spatial distance is

$D_{ij}^{(\rm sp)} = \|p_i - p_j\|_2, \qquad D_{\max} = \sqrt{H^2 + W^2}$

Let $S \subseteq \{0,\ldots,N-1\}$ denote the active retained token set and $C$ its complement. For $i \in C$ , define the normalized nearest-neighbor spatial distance

$\delta_i(S) = \min_{j \in S} D_{ij}^{(\rm sp)},\qquad \bar\delta_i(S) = \frac{\delta_i(S)}{D_{\max}}$

BSS modulates similarity as

$\widetilde M_{ij} = M_{ij} \bigl(1+\lambda\,\bar\delta_i(S)\bigr)$

with buffering strength $\lambda \geq 0$ (authors set $\lambda=0.5$ ). This augmentation increases the apparent redundancy of candidates far from $S$ , deferring their selection. The surrogate selection score is

$r_i = 1 - \max_{j\in S} \widetilde M_{ij}$

with candidates accepted if $\max_{j\in S}\widetilde M_{ij} < \tau^{(t)}$ , with a scheduled threshold described below.

2. Algorithmic Workflow and Parallel Greedy Selection

The pruning process follows a centrifugal paradigm:

Step	Description	Hyperparameters
1	Channel screening	$q$
2	Cosine similarity $M$
3	Pivot initialization	$\kappa$ pivots; Eq. 3.6
4	Compute all $D_{ij}^{(\rm sp)}$ , $D_{\max}$
5	Set threshold/epoch	$\tau^{(0)}=0.8$ , $\Delta\tau=0.1$ , $t=0$
6	Main pruning loop	$\|S\|<R$ , buffer parameter $\lambda$

Candidates in $C$ are ranked by $r_i$ and processed in parallel batches (size $B$ ). For each batch, candidates are analyzed with up-to-date $S$ ; those meeting the threshold condition are added. If no additions occur in a round, the threshold is incremented: $\tau^{(t+1)} = \tau^{(t)} + \Delta\tau$ . The process continues until $|S|=R$ or failsafe ( $\tau > 1+\lambda$ ).

Discarded tokens are clustered by nearest pivot ( $\arg\max_{j\in S} M_{uj}$ ), and final representation aggregation uses similarity-weighted aggregation (SWA) with

$\alpha_{u\to j} = \frac{M_{u j}}{\sum_{u' \in D_j} M_{u' j} + \epsilon}, \quad E_j = \sum_{u\in D_j} \alpha_{u\to j} H_u, \quad H_j \leftarrow \beta H_j + (1-\beta)E_j$

with $\beta=0.3$ .

3. Spatial Buffering and Redundancy Modulation

The central principle of BSS is the spatial modulation of redundancy. The minimum spatial distance $\delta_i(S)$ , normalized to $[0,1]$ by $D_{\max}$ , directly controls the penalization term $1+\lambda\bar\delta_i(S)$ . Candidates spatially further from any retained token are more aggressively up-weighted in redundancy:

Early, strict $\tau$ favors tokens in local neighborhoods of current pivots (low redundancy, dense detail).
As $\tau$ increases, acceptance of more remote tokens allows progressive spatial coverage.
The buffer parameter $\lambda$ mediates the speed and strength of outward expansion: higher $\lambda$ defers far tokens more strongly.

Geometrically, the newly added tokens are preferentially those that are simultaneously low-redundancy and spatially proximal, balanced by the threshold annealing schedule. The approach enforces a principled “buffer” around $S$ , yielding orderly near-to-far token selection, as confirmed by specific qualitative examples (e.g., Fig. 3, (Wu et al., 2 Dec 2025)).

4. Trade-Offs: Redundancy Versus Spatial Coverage

Prior methods based purely on importance tend to over-select from small regions, redundantly wasting capacity. Conversely, pure diversity-based selection over-disperses, missing critical local structure.

BSS provides a calibrated compromise:

Centrifugal, near-to-far selection enables rapid local detail preservation near pivots.
Gradually reduced buffering (via increasing $\tau^{(t)}$ ) ensures global spatial coverage.
The resulting token set covers both fine object details and broader image context.

This design demonstrably mitigates the classic trade-off between redundancy reduction and spatial sparsity: local features are preserved before moving to semantically/visually distinct, distant regions.

5. Empirical Effects and Comparative Evaluation

Empirical evaluation (Wu et al., 2 Dec 2025) shows that BSS-equipped VLM-Pruner achieves high accuracy under extreme sparsity and significant improvements relative to baselines. For instance:

On LLaVA-1.5-7B at 88.9% pruning (retain 64/576 tokens), VLM-Pruner with BSS retains 95.61% of model accuracy, outperforming DivPrune (93.68%) and DART (92.71%).
On OCRBench, VLM-Pruner yields a 1.19× end-to-end inference speedup (FLOPs reduced to 22.1% of baseline) while maintaining accuracy; comparable DART speedup is 1.22× but with lower accuracy.
On Qwen2-VL-7B, equivalent sparsity yields 92.58% retained accuracy and a 1.60× speedup.
Across five VLMs and thirteen benchmarks, BSS consistently outperforms both importance-based and diversity-based approaches, with particularly strong gains at high sparsity.

6. Implementation Notes and Hyperparameters

Key hyperparameters for practical implementation include:

Buffering strength: $\lambda=0.5$
Pruning threshold schedule: $\tau^{(0)}=0.8$ , increment $\Delta\tau=0.1$
Channel screening dimension: $q$
Initial pivot count: $\kappa$
Batch size for parallelism: $B$
Similarity-weighted aggregation: $\beta=0.3$

All required operations (channel screening, nearest-neighbor search, batching, aggregation) are suitable for efficient parallelization. The process is amenable to integration into existing token-pruning pipelines for vision-LLMs.

7. Summary and Significance

Buffering for Spatial Sparsity acts as a sub-graph-modular re-weighting scheme, modifying pairwise visual token similarity with normalized spatial distance, embedded within an annealed-threshold greedy pruning framework. This approach results in orderly, centrifugal token selection that maintains both semantic and spatial coverage, addressing key shortcomings of previous redundancy- or diversity-guided pruning designs. Comprehensive evaluations affirm its effectiveness at high sparsity in multiple VLM architectures and tasks (Wu et al., 2 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Buffering for Spatial Sparsity (BSS).