Buffering for Spatial Sparsity (BSS)
- Buffering for Spatial Sparsity (BSS) is a pruning technique that reweights token similarity using normalized spatial distances to balance redundancy reduction with spatial coverage.
- It employs a centrifugal, parallel greedy selection algorithm with channel screening and selective feature fusion, dynamically adjusting thresholds for orderly token selection.
- Empirical evaluations demonstrate that BSS retains over 95% accuracy at extreme sparsity while achieving significant inference speedups across various vision-language models.
Buffering for Spatial Sparsity (BSS) is a criterion introduced to address the challenge of efficiently pruning visual tokens in vision-LLMs (VLMs) while maintaining both redundancy reduction and adequate spatial coverage of semantic content. BSS operates within the VLM-Pruner framework, a training-free, centrifugal (near-to-far) token pruning algorithm that employs a parallel greedy strategy and selective feature fusion to achieve high inference speed with minimal performance degradation at extreme sparsity (Wu et al., 2 Dec 2025).
1. Mathematical Formulation of Buffering for Spatial Sparsity
Suppose an image yields a feature map of size , resulting in visual tokens indexed by . Each token contains a -dimensional key vector and a -dimensional hidden state . The algorithm first projects onto its highest-variance channels (channel screening), producing .
The pairwise cosine similarity is
Define each token's 2D grid coordinate , with , . The spatial distance is
Let denote the active retained token set and its complement. For , define the normalized nearest-neighbor spatial distance
BSS modulates similarity as
with buffering strength (authors set ). This augmentation increases the apparent redundancy of candidates far from , deferring their selection. The surrogate selection score is
with candidates accepted if , with a scheduled threshold described below.
2. Algorithmic Workflow and Parallel Greedy Selection
The pruning process follows a centrifugal paradigm:
| Step | Description | Hyperparameters |
|---|---|---|
| 1 | Channel screening | |
| 2 | Cosine similarity | |
| 3 | Pivot initialization | pivots; Eq. 3.6 |
| 4 | Compute all , | |
| 5 | Set threshold/epoch | , , |
| 6 | Main pruning loop | , buffer parameter |
Candidates in are ranked by and processed in parallel batches (size ). For each batch, candidates are analyzed with up-to-date ; those meeting the threshold condition are added. If no additions occur in a round, the threshold is incremented: . The process continues until or failsafe ().
Discarded tokens are clustered by nearest pivot (), and final representation aggregation uses similarity-weighted aggregation (SWA) with
with .
3. Spatial Buffering and Redundancy Modulation
The central principle of BSS is the spatial modulation of redundancy. The minimum spatial distance , normalized to by , directly controls the penalization term . Candidates spatially further from any retained token are more aggressively up-weighted in redundancy:
- Early, strict favors tokens in local neighborhoods of current pivots (low redundancy, dense detail).
- As increases, acceptance of more remote tokens allows progressive spatial coverage.
- The buffer parameter mediates the speed and strength of outward expansion: higher defers far tokens more strongly.
Geometrically, the newly added tokens are preferentially those that are simultaneously low-redundancy and spatially proximal, balanced by the threshold annealing schedule. The approach enforces a principled “buffer” around , yielding orderly near-to-far token selection, as confirmed by specific qualitative examples (e.g., Fig. 3, (Wu et al., 2 Dec 2025)).
4. Trade-Offs: Redundancy Versus Spatial Coverage
Prior methods based purely on importance tend to over-select from small regions, redundantly wasting capacity. Conversely, pure diversity-based selection over-disperses, missing critical local structure.
BSS provides a calibrated compromise:
- Centrifugal, near-to-far selection enables rapid local detail preservation near pivots.
- Gradually reduced buffering (via increasing ) ensures global spatial coverage.
- The resulting token set covers both fine object details and broader image context.
This design demonstrably mitigates the classic trade-off between redundancy reduction and spatial sparsity: local features are preserved before moving to semantically/visually distinct, distant regions.
5. Empirical Effects and Comparative Evaluation
Empirical evaluation (Wu et al., 2 Dec 2025) shows that BSS-equipped VLM-Pruner achieves high accuracy under extreme sparsity and significant improvements relative to baselines. For instance:
- On LLaVA-1.5-7B at 88.9% pruning (retain 64/576 tokens), VLM-Pruner with BSS retains 95.61% of model accuracy, outperforming DivPrune (93.68%) and DART (92.71%).
- On OCRBench, VLM-Pruner yields a 1.19× end-to-end inference speedup (FLOPs reduced to 22.1% of baseline) while maintaining accuracy; comparable DART speedup is 1.22× but with lower accuracy.
- On Qwen2-VL-7B, equivalent sparsity yields 92.58% retained accuracy and a 1.60× speedup.
- Across five VLMs and thirteen benchmarks, BSS consistently outperforms both importance-based and diversity-based approaches, with particularly strong gains at high sparsity.
6. Implementation Notes and Hyperparameters
Key hyperparameters for practical implementation include:
- Buffering strength:
- Pruning threshold schedule: , increment
- Channel screening dimension:
- Initial pivot count:
- Batch size for parallelism:
- Similarity-weighted aggregation:
All required operations (channel screening, nearest-neighbor search, batching, aggregation) are suitable for efficient parallelization. The process is amenable to integration into existing token-pruning pipelines for vision-LLMs.
7. Summary and Significance
Buffering for Spatial Sparsity acts as a sub-graph-modular re-weighting scheme, modifying pairwise visual token similarity with normalized spatial distance, embedded within an annealed-threshold greedy pruning framework. This approach results in orderly, centrifugal token selection that maintains both semantic and spatial coverage, addressing key shortcomings of previous redundancy- or diversity-guided pruning designs. Comprehensive evaluations affirm its effectiveness at high sparsity in multiple VLM architectures and tasks (Wu et al., 2 Dec 2025).