SharpV: Adaptive Pruning for VideoLLMs
- SharpV is an adaptive framework that prunes visual tokens using spatial-temporal metrics, reducing token overhead for efficient VideoLLM inference.
- The method employs self-calibration of key-value caches to mitigate feature degradation while cutting GPU memory usage and computational costs.
- Empirical benchmarks show that SharpV maintains near 100% of dense accuracy with significant speed and resource improvements in video processing.
SharpV is a two-stage, information-aware visual token and memory pruning framework designed to optimize the inference efficiency and accuracy of Video LLMs (VideoLLMs). The approach strategically reduces token and memory overhead by adaptively selecting and maintaining only the most informative visual features throughout both pre-LLM and intra-LLM processing stages. SharpV distinguishes itself by operating without access to explicit attention scores, ensuring compatibility with hardware accelerators such as FlashAttention and enabling real-time, scalable deployment for long or high-resolution video inputs.
1. Quadratic Complexity in VideoLLMs and Motivation
VideoLLMs typically tokenize each input video frame into visual tokens, resulting in a total of tokens for frames. Self-attention and cross-attention computations during both prefill and decode stages scale quadratically as , with denoting hidden dimensionality. Additionally, the key-value (KV) cache maintained for decoding scales similarly, making the cost prohibitive for long videos or high-resolution streams. These scaling constraints lead to excessive GPU memory consumption and computational bottlenecks (e.g., FLOPs). SharpV is developed to mitigate these challenges through adaptive, information-driven pruning while maintaining compatibility with black-box accelerators.
2. Adaptive Spatial-Temporal Token Pruning
SharpV introduces a spatial-temporal adaptive visual token pruning strategy prior to any LLM operation ("Visual SharpV"):
- Each token (token in frame ) is scored using spatial and temporal importance metrics.
- Spatial Importance: Measures token uniqueness within frame ,
where and .
- Temporal Importance: Measures the change from the previous frame,
- Combined Importance Score:
with as a scalar modulator.
A dynamic, frame-adaptive pruning ratio is computed via a linear “information volume” predictor,
where summarizes frame variation, and is the sigmoid function. Frames with high spatial-temporal variation receive a higher pruning threshold, preserving more tokens. The top tokens are retained per frame using hard or soft gating, ensuring computational complexity of .
3. Key-Value (KV) Cache Pruning via Self-Calibration
After token pruning, VideoLLMs accumulate KV caches across layers . SharpV monitors the degradation of visual token representations by measuring cosine similarity between each layer’s visual tokens and their original embeddings : Empirical evidence indicates decreases from near $1.0$ in shallow layers to in deep layers. When similarity falls below a designated threshold , representations are considered degraded.
From the perspective of the Information Bottleneck principle (minimizing ), dropping or compressing degraded KV caches approximates optimal information flow: discarding information no longer relevant for generating accurate outputs while reducing memory consumption.
Rather than outright dropping caches, SharpV can “self-calibrate”,
which blends deep layer features back toward their original embedding, mitigating stale drift and maintaining coherent context for downstream processing.
4. Implementation Pipeline and Compatibility
SharpV is implemented through a two-stage pipeline:
| Stage | Operation | Complexity |
|---|---|---|
| Visual SharpV (pre-LLM) | Spatio-temporal adaptive token pruning | |
| Memory SharpV (intra-LLM) | Layer-wise KV cache pruning/self-calib. |
- Visual SharpV is applied before LLM processing to reduce the input token set from to .
- During LLM inference (prefill/decode), Memory SharpV operates at each Transformer layer, computing and applying pruning/self-calibration in linear time.
- No re-training or architecture modification of the LLM is required; the method only needs access to visual embeddings and lightweight linear layers.
- The framework’s independence from explicit attention scores enables seamless compatibility with hardware accelerators such as FlashAttention, which constitute linear-time black-box kernels.
5. Empirical Performance and Benchmarking
SharpV has been integrated into representative VideoLLMs (PLLaVA-7B, LLaVA-OneVision-7B) and evaluated across four public video-language benchmarks: MVBench, VideoMME, NExTQA, and ActivityNet-QA. Key outcomes include:
- Token Budget vs. Accuracy:
- Dense baseline (100% of tokens): 56–62% accuracy.
- SharpV (adaptive, ~12% token budget): maintains between 99–101% of dense accuracy, occasionally surpassing by 1–2% due to noise pruning.
- Efficiency Gains:
- Memory usage reduced by 30–40% (e.g., 18GB 9GB).
- Time-To-First-Token improved by 1.6; Time-Per-Output-Token similarly.
- FLOPs reduced in proportion to token budget (3.4 TFlops 1.1 TFlops).
- Comparative Analysis:
- Uniform pruning (e.g., DyCoke at fixed 15%): results in 1–2% accuracy loss.
- Graph/clustering baselines (PruneMerge, DivPrune): introduce extra overhead with no improvement over SharpV’s tradeoff.
6. Conceptual Contributions and Theoretical Implications
SharpV demonstrates that adaptive, information-aware token and cache pruning results in strictly superior efficiency–accuracy trade-offs relative to uniform or clustering-based compression. The methodology enables hierarchical cache pruning framed as an information bottleneck process, facilitating a new understanding of information flow in generative video models; shallow layers effect rapid encoding, while deeper layers function as compression stages, analogous to memory curves in biological cognition.
This suggests broader applicability of information-theoretical perspectives to LLM architecture design and post-processing, and a plausible implication is that self-calibrated token and memory management may yield further robustness or interpretability.
7. Prospects for Enhancement and Future Work
Several avenues for extension are identified:
- Replacing the linear per-frame predictor with a learnable variant for more expressive adaptation.
- Finer-grained pruning at the token level (patches, pixels) to capture richer spatial variation.
- Integrating the bottleneck objective into end-to-end model training for optimal representation selection.
SharpV establishes a minimal, training-free, attention-score-independent baseline for VideoLLM efficiency enhancement, achieving up to 8 reduction in token utilization without loss—and sometimes a gain—in downstream reasoning performance.