Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 186 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 65 tok/s Pro
Kimi K2 229 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

SharpV: Adaptive Pruning for VideoLLMs

Updated 13 November 2025
  • SharpV is an adaptive framework that prunes visual tokens using spatial-temporal metrics, reducing token overhead for efficient VideoLLM inference.
  • The method employs self-calibration of key-value caches to mitigate feature degradation while cutting GPU memory usage and computational costs.
  • Empirical benchmarks show that SharpV maintains near 100% of dense accuracy with significant speed and resource improvements in video processing.

SharpV is a two-stage, information-aware visual token and memory pruning framework designed to optimize the inference efficiency and accuracy of Video LLMs (VideoLLMs). The approach strategically reduces token and memory overhead by adaptively selecting and maintaining only the most informative visual features throughout both pre-LLM and intra-LLM processing stages. SharpV distinguishes itself by operating without access to explicit attention scores, ensuring compatibility with hardware accelerators such as FlashAttention and enabling real-time, scalable deployment for long or high-resolution video inputs.

1. Quadratic Complexity in VideoLLMs and Motivation

VideoLLMs typically tokenize each input video frame into ff visual tokens, resulting in a total of N=n×fN = n \times f tokens for nn frames. Self-attention and cross-attention computations during both prefill and decode stages scale quadratically as O(N2d)\mathcal{O}(N^2 \cdot d), with dd denoting hidden dimensionality. Additionally, the key-value (KV) cache maintained for decoding scales similarly, making the cost prohibitive for long videos or high-resolution streams. These scaling constraints lead to excessive GPU memory consumption and computational bottlenecks (e.g., FLOPs). SharpV is developed to mitigate these challenges through adaptive, information-driven pruning while maintaining compatibility with black-box accelerators.

2. Adaptive Spatial-Temporal Token Pruning

SharpV introduces a spatial-temporal adaptive visual token pruning strategy prior to any LLM operation ("Visual SharpV"):

  • Each token vi,tRdv_{i,t} \in \mathbb{R}^d (token ii in frame tt) is scored using spatial and temporal importance metrics.
  • Spatial Importance: Measures token uniqueness within frame tt,

Rs(i,t)=v^i,tμ^t2\mathcal{R}_s(i,t) = \lVert \hat{v}_{i,t} - \hat{\mu}_t \rVert_2

where μt=1fj=1fvj,t\mu_t = \frac{1}{f} \sum_{j=1}^f v_{j,t} and x^=x/x2\hat{x} = x / \lVert x \rVert_2.

  • Temporal Importance: Measures the change from the previous frame,

Rt(i,t)=v^i,tv^i,t12,Rt(i,1)=0\mathcal{R}_t(i,t) = \lVert \hat{v}_{i,t} - \hat{v}_{i,t-1} \rVert_2, \quad \mathcal{R}_t(i,1) = 0

  • Combined Importance Score:

S(i,t)=Rt(i,t)+wRs(i,t)S(i,t) = \mathcal{R}_t(i,t) + w \cdot \mathcal{R}_s(i,t)

with ww as a scalar modulator.

A dynamic, frame-adaptive pruning ratio α(t)(0,1)\alpha(t) \in (0,1) is computed via a linear “information volume” predictor,

α(t)=σ(W1ft+b1)\alpha(t) = \sigma(W_1 f_t + b_1)

where ftf_t summarizes frame variation, and σ()\sigma(\cdot) is the sigmoid function. Frames with high spatial-temporal variation receive a higher pruning threshold, preserving more tokens. The top Kt=α(t)fK_t = \lceil \alpha(t) \cdot f \rceil tokens are retained per frame using hard or soft gating, ensuring computational complexity of O(fd)\mathcal{O}(f \cdot d).

3. Key-Value (KV) Cache Pruning via Self-Calibration

After token pruning, VideoLLMs accumulate KV caches across layers l=1,,Ll = 1,\ldots,L. SharpV monitors the degradation of visual token representations by measuring cosine similarity between each layer’s visual tokens VlV_l and their original embeddings V0V_0: siml=VlV0Vl2V02\mathrm{sim}_l = \frac{V_l \cdot V_0}{\lVert V_l \rVert_2 \lVert V_0 \rVert_2} Empirical evidence indicates siml\mathrm{sim}_l decreases from near $1.0$ in shallow layers to 0.2\approx 0.2 in deep layers. When similarity falls below a designated threshold M[0,1]M \in [0,1], representations are considered degraded.

From the perspective of the Information Bottleneck principle (minimizing I(X;Z^)βI(Z^;Y)I(X;\hat{Z}) - \beta I(\hat{Z};Y)), dropping or compressing degraded KV caches approximates optimal information flow: discarding information no longer relevant for generating accurate outputs while reducing memory consumption.

Rather than outright dropping caches, SharpV can “self-calibrate”,

Vl=simlV0+(1siml)VlV'_l = \mathrm{sim}_l V_0 + (1 - \mathrm{sim}_l) V_l

which blends deep layer features back toward their original embedding, mitigating stale drift and maintaining coherent context for downstream processing.

4. Implementation Pipeline and Compatibility

SharpV is implemented through a two-stage pipeline:

Stage Operation Complexity
Visual SharpV (pre-LLM) Spatio-temporal adaptive token pruning O(nfd)\mathcal{O}(nfd)
Memory SharpV (intra-LLM) Layer-wise KV cache pruning/self-calib. O(Ktd)\mathcal{O}(K_td)
  • Visual SharpV is applied before LLM processing to reduce the input token set from n×fn \times f to tKtn×f\sum_t K_t \ll n \times f.
  • During LLM inference (prefill/decode), Memory SharpV operates at each Transformer layer, computing siml\mathrm{sim}_l and applying pruning/self-calibration in linear time.
  • No re-training or architecture modification of the LLM is required; the method only needs access to visual embeddings and lightweight linear layers.
  • The framework’s independence from explicit attention scores enables seamless compatibility with hardware accelerators such as FlashAttention, which constitute linear-time black-box kernels.

5. Empirical Performance and Benchmarking

SharpV has been integrated into representative VideoLLMs (PLLaVA-7B, LLaVA-OneVision-7B) and evaluated across four public video-language benchmarks: MVBench, VideoMME, NExTQA, and ActivityNet-QA. Key outcomes include:

  • Token Budget vs. Accuracy:
    • Dense baseline (100% of tokens): 56–62% accuracy.
    • SharpV (adaptive, ~12% token budget): maintains between 99–101% of dense accuracy, occasionally surpassing by 1–2% due to noise pruning.
  • Efficiency Gains:
    • Memory usage reduced by 30–40% (e.g., 18GB \to 9GB).
    • Time-To-First-Token improved by \sim1.6×\times; Time-Per-Output-Token similarly.
    • FLOPs reduced in proportion to token budget (3.4 TFlops \to \approx1.1 TFlops).
  • Comparative Analysis:
    • Uniform pruning (e.g., DyCoke at fixed 15%): results in 1–2% accuracy loss.
    • Graph/clustering baselines (PruneMerge, DivPrune): introduce extra overhead with no improvement over SharpV’s tradeoff.

6. Conceptual Contributions and Theoretical Implications

SharpV demonstrates that adaptive, information-aware token and cache pruning results in strictly superior efficiency–accuracy trade-offs relative to uniform or clustering-based compression. The methodology enables hierarchical cache pruning framed as an information bottleneck process, facilitating a new understanding of information flow in generative video models; shallow layers effect rapid encoding, while deeper layers function as compression stages, analogous to memory curves in biological cognition.

This suggests broader applicability of information-theoretical perspectives to LLM architecture design and post-processing, and a plausible implication is that self-calibrated token and memory management may yield further robustness or interpretability.

7. Prospects for Enhancement and Future Work

Several avenues for extension are identified:

  • Replacing the linear per-frame predictor α(t)\alpha(t) with a learnable variant for more expressive adaptation.
  • Finer-grained pruning at the token level (patches, pixels) to capture richer spatial variation.
  • Integrating the bottleneck objective into end-to-end model training for optimal representation selection.

SharpV establishes a minimal, training-free, attention-score-independent baseline for VideoLLM efficiency enhancement, achieving up to 8×\times reduction in token utilization without loss—and sometimes a gain—in downstream reasoning performance.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SharpV.