SharpV: Adaptive Pruning for VideoLLMs

Updated 13 November 2025

SharpV is an adaptive framework that prunes visual tokens using spatial-temporal metrics, reducing token overhead for efficient VideoLLM inference.
The method employs self-calibration of key-value caches to mitigate feature degradation while cutting GPU memory usage and computational costs.
Empirical benchmarks show that SharpV maintains near 100% of dense accuracy with significant speed and resource improvements in video processing.

SharpV is a two-stage, information-aware visual token and memory pruning framework designed to optimize the inference efficiency and accuracy of Video LLMs (VideoLLMs). The approach strategically reduces token and memory overhead by adaptively selecting and maintaining only the most informative visual features throughout both pre-LLM and intra-LLM processing stages. SharpV distinguishes itself by operating without access to explicit attention scores, ensuring compatibility with hardware accelerators such as FlashAttention and enabling real-time, scalable deployment for long or high-resolution video inputs.

1. Quadratic Complexity in VideoLLMs and Motivation

VideoLLMs typically tokenize each input video frame into $f$ visual tokens, resulting in a total of $N = n \times f$ tokens for $n$ frames. Self-attention and cross-attention computations during both prefill and decode stages scale quadratically as $\mathcal{O}(N^2 \cdot d)$ , with $d$ denoting hidden dimensionality. Additionally, the key-value (KV) cache maintained for decoding scales similarly, making the cost prohibitive for long videos or high-resolution streams. These scaling constraints lead to excessive GPU memory consumption and computational bottlenecks (e.g., FLOPs). SharpV is developed to mitigate these challenges through adaptive, information-driven pruning while maintaining compatibility with black-box accelerators.

2. Adaptive Spatial-Temporal Token Pruning

SharpV introduces a spatial-temporal adaptive visual token pruning strategy prior to any LLM operation ("Visual SharpV"):

Each token $v_{i,t} \in \mathbb{R}^d$ (token $i$ in frame $t$ ) is scored using spatial and temporal importance metrics.
Spatial Importance: Measures token uniqueness within frame $t$ ,

$\mathcal{R}_s(i,t) = \lVert \hat{v}_{i,t} - \hat{\mu}_t \rVert_2$

where $\mu_t = \frac{1}{f} \sum_{j=1}^f v_{j,t}$ and $\hat{x} = x / \lVert x \rVert_2$ .

Temporal Importance: Measures the change from the previous frame,

$\mathcal{R}_t(i,t) = \lVert \hat{v}_{i,t} - \hat{v}_{i,t-1} \rVert_2, \quad \mathcal{R}_t(i,1) = 0$

Combined Importance Score:

$S(i,t) = \mathcal{R}_t(i,t) + w \cdot \mathcal{R}_s(i,t)$

with $w$ as a scalar modulator.

A dynamic, frame-adaptive pruning ratio $\alpha(t) \in (0,1)$ is computed via a linear “information volume” predictor,

$\alpha(t) = \sigma(W_1 f_t + b_1)$

where $f_t$ summarizes frame variation, and $\sigma(\cdot)$ is the sigmoid function. Frames with high spatial-temporal variation receive a higher pruning threshold, preserving more tokens. The top $K_t = \lceil \alpha(t) \cdot f \rceil$ tokens are retained per frame using hard or soft gating, ensuring computational complexity of $\mathcal{O}(f \cdot d)$ .

3. Key-Value (KV) Cache Pruning via Self-Calibration

After token pruning, VideoLLMs accumulate KV caches across layers $l = 1,\ldots,L$ . SharpV monitors the degradation of visual token representations by measuring cosine similarity between each layer’s visual tokens $V_l$ and their original embeddings $V_0$ : $\mathrm{sim}_l = \frac{V_l \cdot V_0}{\lVert V_l \rVert_2 \lVert V_0 \rVert_2}$ Empirical evidence indicates $\mathrm{sim}_l$ decreases from near $1.0$ in shallow layers to $\approx 0.2$ in deep layers. When similarity falls below a designated threshold $M \in [0,1]$ , representations are considered degraded.

From the perspective of the Information Bottleneck principle (minimizing $I(X;\hat{Z}) - \beta I(\hat{Z};Y)$ ), dropping or compressing degraded KV caches approximates optimal information flow: discarding information no longer relevant for generating accurate outputs while reducing memory consumption.

Rather than outright dropping caches, SharpV can “self-calibrate”,

$V'_l = \mathrm{sim}_l V_0 + (1 - \mathrm{sim}_l) V_l$

which blends deep layer features back toward their original embedding, mitigating stale drift and maintaining coherent context for downstream processing.

4. Implementation Pipeline and Compatibility

SharpV is implemented through a two-stage pipeline:

Stage	Operation	Complexity
Visual SharpV (pre-LLM)	Spatio-temporal adaptive token pruning	$\mathcal{O}(nfd)$
Memory SharpV (intra-LLM)	Layer-wise KV cache pruning/self-calib.	$\mathcal{O}(K_td)$

Visual SharpV is applied before LLM processing to reduce the input token set from $n \times f$ to $\sum_t K_t \ll n \times f$ .
During LLM inference (prefill/decode), Memory SharpV operates at each Transformer layer, computing $\mathrm{sim}_l$ and applying pruning/self-calibration in linear time.
No re-training or architecture modification of the LLM is required; the method only needs access to visual embeddings and lightweight linear layers.
The framework’s independence from explicit attention scores enables seamless compatibility with hardware accelerators such as FlashAttention, which constitute linear-time black-box kernels.

5. Empirical Performance and Benchmarking

SharpV has been integrated into representative VideoLLMs (PLLaVA-7B, LLaVA-OneVision-7B) and evaluated across four public video-language benchmarks: MVBench, VideoMME, NExTQA, and ActivityNet-QA. Key outcomes include:

Token Budget vs. Accuracy:
- Dense baseline (100% of tokens): 56–62% accuracy.
- SharpV (adaptive, ~12% token budget): maintains between 99–101% of dense accuracy, occasionally surpassing by 1–2% due to noise pruning.
Efficiency Gains:
- Memory usage reduced by 30–40% (e.g., 18GB $\to$ 9GB).
- Time-To-First-Token improved by $\sim$ 1.6 $\times$ ; Time-Per-Output-Token similarly.
- FLOPs reduced in proportion to token budget (3.4 TFlops $\to$ $\approx$ 1.1 TFlops).
Comparative Analysis:
- Uniform pruning (e.g., DyCoke at fixed 15%): results in 1–2% accuracy loss.
- Graph/clustering baselines (PruneMerge, DivPrune): introduce extra overhead with no improvement over SharpV’s tradeoff.

6. Conceptual Contributions and Theoretical Implications

SharpV demonstrates that adaptive, information-aware token and cache pruning results in strictly superior efficiency–accuracy trade-offs relative to uniform or clustering-based compression. The methodology enables hierarchical cache pruning framed as an information bottleneck process, facilitating a new understanding of information flow in generative video models; shallow layers effect rapid encoding, while deeper layers function as compression stages, analogous to memory curves in biological cognition.

This suggests broader applicability of information-theoretical perspectives to LLM architecture design and post-processing, and a plausible implication is that self-calibrated token and memory management may yield further robustness or interpretability.

7. Prospects for Enhancement and Future Work

Several avenues for extension are identified:

Replacing the linear per-frame predictor $\alpha(t)$ with a learnable variant for more expressive adaptation.
Finer-grained pruning at the token level (patches, pixels) to capture richer spatial variation.
Integrating the bottleneck objective into end-to-end model training for optimal representation selection.

SharpV establishes a minimal, training-free, attention-score-independent baseline for VideoLLM efficiency enhancement, achieving up to 8 $\times$ reduction in token utilization without loss—and sometimes a gain—in downstream reasoning performance.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SharpV.