Keyframe-oriented Vision Token Pruning (KVTP)
- The paper introduces KVTP, a framework that unifies query-guided soft keyframe selection with per-frame token pruning to efficiently reduce redundant tokens.
- KVTP achieves up to 80% reduction in token usage and 64% lower FLOPs while maintaining video QA accuracy within 1% of full-token baselines.
- KVTP offers plug-and-play integration with various pruning methods and extends to embodied reasoning (e.g., EgoPrune), demonstrating practical real-time efficiency.
Keyframe-oriented Vision Token Pruning (KVTP) is a framework designed to improve the efficiency of large vision-LLMs (VLMs) when processing long-form videos, by adaptively pruning redundant vision tokens in both spatial and temporal dimensions. The central contribution of KVTP is its unification of query-guided, soft keyframe selection and per-frame token pruning, enabling significant reductions in computation and memory while preserving critical spatiotemporal and contextual information necessary for tasks such as video question answering (QA) or egomotion-based reasoning (Liu et al., 13 Mar 2025, Li et al., 21 Jul 2025).
1. Motivation and Key Principles
Long-form video input presents two dominant forms of redundancy for VLMs:
- Spatial Redundancy: Within each frame, many visual tokens provide little incremental information with respect to the given query.
- Temporal Redundancy: Across the video sequence, only a small subset of frames are typically relevant to answering the query.
Previous approaches addressed these issues through either uniform, query-agnostic token pruning—which risks severing spatiotemporal coherence—or by selecting a discrete set of hard keyframes, which can disrupt essential context by discarding intermediate frames. KVTP instead employs a soft, query-conditioned mechanism, assigning a relevance-informed pruning rate to each frame and keeping a variable number of tokens per frame. This preserves temporal and logical cues, maintains high downstream accuracy, and yields substantial FLOPs and memory reductions (Liu et al., 13 Mar 2025).
2. Mathematical Foundations
Frame-wise Relevance and Pruning Rate Assignment
Let denote the video frames, the input query, and , pre-trained encoders (e.g., SigLIP) mapping images and text into -dimensional embeddings. KVTP first computes embeddings .
For each frame, the raw relevance logit is
where and are learned scalars. Relevance distribution yields per-frame weights with .
To distribute pruning factors adaptively yet ensure the overall pruning target (user-specified average), each frame is assigned a pruning rate
resulting in the retention of tokens per frame, being the initial token count per frame.
Relevance Predictor: Fine-Tuning Objective
Fine-tuning involves minimizing a contrastive-style loss augmented with temporal context:
- Ground-truth relevance labels from GPT-4o-annotated captions are normalized;
- Embeddings are fused locally (within clips) and globally (across the entire video) using learned temperatures () and mixture weights () for context sensitivity;
- Loss combines per-frame logits via sigmoid cross-entropy against normalized relevance targets, using both local and global temporal features.
Token Importance Scoring
For each frame, token importance is computed using a token aggregator (e.g., ToMe, PruMerge, or FastV), and the top tokens by importance are retained.
3. Inference Algorithm
The KVTP inference workflow comprises the following stages:
- Query-Conditioned Frame Relevance: Predict frame relevance to the query via the fine-tuned SigLIP encoder.
- Pruning Rate Determination: Map relevance scores to per-frame pruning rates, yielding variable token quotas for each frame.
- Token Pruning: Within each frame, compute per-token importance and keep only the most informative tokens.
- Fusion and Downstream Processing: Collate all retained tokens with text tokens and feed the sequence to the LLM backbone for QA or reasoning.
Pseudocode formalizes these steps, showing a clear division between frame relevance assessment, per-frame pruning, and sequence assembly for model input (Liu et al., 13 Mar 2025).
4. Evaluation: SparseKV-QA Benchmark and Results
SparseKV-QA Benchmark
To assess KVTP in realistic, sparse-event scenarios, the SparseKV-QA benchmark was constructed by curating and reorganizing seven video QA corpora, with sampling (up to 128 frames/video), clip-based captioning, and GPT-4o-driven frame-level relevance scoring. The final evaluation set spans 20,050 long videos (mean length 451 s, sparsity 3.56), ensuring a realistic distribution of crucial and redundant events.
Experimental Comparisons
KVTP was evaluated using LLaVA-Video (7B and 72B) with various token pruning baselines. Key findings include:
- Token Usage and FLOPs: KVTP reduces vision token count by up to 80% and FLOPs by approximately 64%.
- Accuracy Preservation: QA accuracy drops less than 1% compared to full-token baselines. For some domains, e.g., VideoMME and EgoSchema, KVTP + PruMerge slightly outperforms the unpruned model at substantially reduced cost.
Table: LLaVA-Video-7B, accuracy and FLOPs on three datasets
| Method | FLOPs | VideoMME | EgoSchema | NextQA |
|---|---|---|---|---|
| Full | 100% | 62.63 | 54.17 | 78.51 |
| PruMerge + KVTP | 36% | 63.29 | 54.71 | 76.76 |
| ToMe + KVTP | 38% | 62.36 | 53.24 | 75.88 |
| Random + KVTP | 36% | 60.16 | 52.73 | 76.50 |
| KeyVideoLLM | 36% | 51.32 | 46.78 | 64.33 |
KVTP consistently boosts the performance of all token-level pruning methods, often recovering lost accuracy from aggressive pruning and performing on par with or surpassing full computation in certain settings. (Liu et al., 13 Mar 2025)
5. Ablations, Qualitative Analysis, and Sensitivity
- Sparsity Sensitivity: KVTP's advantage increases in sparser videos; performance gains over vanilla PruMerge are more prominent as the proportion of critical frames decreases.
- Soft versus Hard Selection: The temperature of the softmax used in frame selection is crucial; hard top-k selection severely degrades accuracy, while moderate temperature preserves context.
- Predictor Fine-tuning: Fine-tuning only the SigLIP context fusion head achieves superior relevance prediction over alternatives, despite requiring fewer additional parameters (0.88B vs. 7.88B for full retraining).
- Qualitative Example: In complex event reasoning (pouring-water task), KVTP preserves both tokens in highly relevant frames and a few from adjacent frames, thus supporting correct answers where uniform pruning or hard keyframe selection fail.
6. KVTP Extensions: EgoPrune and Embodied Reasoning
KVTP has been adapted to egomotion video reasoning in the EgoPrune framework for embodied agents (Li et al., 21 Jul 2025). The instantiation emphasizes geometric alignment and training-free deployment:
- Keyframe Selection: Utilizes overlap-based sampling from EmbodiedR to retain only those frames with significant viewpoint changes, reducing redundancy without information loss.
- Perspective-Aware Redundancy Filtering (PARF): Estimates homographies between consecutive keyframes, aligning patches, and discarding spatially redundant tokens via thresholded cosine similarity.
- MMR Token Selection: A Maximal Marginal Relevance procedure balances token-to-query relevance and intra-token diversity, greedily selecting the most informative and non-redundant tokens for downstream input.
Empirical results on VSI-Bench and UrbanVideo-Bench demonstrate that EgoPrune matches or slightly exceeds the full-token baseline in accuracy even at 30–50% retention, with 20–40% lower FLOPs, 20–30% latency reduction, and 25–35% reduced memory use. Deployment on Jetson Orin NX confirms real-time embodied agent applicability (Li et al., 21 Jul 2025).
7. Impact, Limitations, and Future Directions
KVTP provides a generalizable and efficient mechanism for context-aware compression in multimodal transformers:
- Plug-and-play Integration: Its soft, query-guided pruning is compatible with a range of token-level pruning methods and vision-language backbone architectures.
- Parameter-Efficiency: Fine-tuning is lightweight (<1B additional parameters).
- Redundancy Handling: By bridging token-level and keyframe-level pruning, KVTP is robust for both generic video QA and specialized egomotion video reasoning.
Limitations include the computational overhead of embedding large numbers of frames for relevance assessment. Extensions may involve hierarchical or cascaded predictors, end-to-end joint training with the VLM backbone, or applications beyond QA, such as video captioning, retrieval, and multimodal moment localization. These advances suggest broader applicability for KVTP in scalable, real-world deployment of vision-LLMs across diverse domains (Liu et al., 13 Mar 2025, Li et al., 21 Jul 2025).